Methods and compositions for nucleic acid sequencing

ABSTRACT

Embodiments of the present invention relate to sequencing nucleic acids. In particular, embodiments of the methods and compositions provided herein relate to preparing nucleic acid templates and obtaining sequence data therefrom.

FIELD OF THE INVENTION

Embodiments of the present invention relate to sequencing nucleic acids.In particular, embodiments of the methods and compositions providedherein relate to preparing nucleic acid templates and obtaining sequencedata therefrom.

BACKGROUND OF THE INVENTION

The detection of specific nucleic acid sequences present in a biologicalsample has been used, for example, as a method for identifying andclassifying microorganisms, diagnosing infectious diseases, detectingand characterizing genetic abnormalities, identifying genetic changesassociated with cancer, studying genetic susceptibility to disease, andmeasuring response to various types of treatment. A common technique fordetecting specific nucleic acid sequences in a biological sample isnucleic acid sequencing.

Nucleic acid sequencing methodology has evolved significantly from thechemical degradation methods used by Maxam and Gilbert and the strandelongation methods used by Sanger. Today several sequencingmethodologies are in use which allow for the parallel processing ofnucleic acids all in a single sequencing run. As such, the informationgenerated from a single sequencing run can be enormous.

SUMMARY OF THE INVENTION

Some embodiments of the methods and compositions provided herein includea method of obtaining sequence information from a target nucleic acid,said method comprising: (a) obtaining a template nucleic acid comprisinga plurality of transposomes inserted into said target nucleic acid,wherein at least some of the inserted transposomess each comprise afirst transposon sequence, a second transposon sequence noncontiguouswith said first transposon sequence, and a transposase associated withthe first transposon sequence and the second transposon sequence; (b)compartmentalizing the template nucleic acid comprising said pluralityof inserted transposomes into each vessel of a plurality of vessels; (c)removing the transposase from the template nucleic acid; and (d)obtaining sequence information from the template nucleic acid of eachvessel.

In some embodiments, step (b) comprises providing each vessel with anamount of template nucleic acid equal to about less than one haploidequivalent, about equal than one haploid equivalent, or more than onehaploid equivalent of the target nucleic acid,

In some embodiments, step (b) comprises providing each vessel with anamount of template nucleic acid less than about one haploid equivalentof the target nucleic acid.

In some embodiments, step (c) comprises a method selected from the groupconsisting of adding a detergent, changing temperature, changing pH,adding a protease, adding a chaperone, and adding a polymerase.

In some embodiments, the first transposon sequence comprises a firstprimer site and the second transposon sequences comprise a second primersite.

In some embodiments, the first primer site further comprises a firstbarcode and the second primer site further comprises a second barcode.

In some embodiments, the first barcode and second barcode are different.

In some embodiments, the target nucleic acid comprises an amplifiednucleic acid.

In some embodiments, the target nucleic acid is obtained by enriching aplurality of nucleic acids for a sequence of interest before or aftertransposition.

In some embodiments, step (a) further comprises enriching the templatenucleic acid for a sequence of interest.

In some embodiments, the target nucleic acid comprises genomic DNA.

In some embodiments, step (d) further comprises assembling from sequencedata a representation of at least a portion of said template nucleicacid from each vessel.

In some embodiments, the sequence information comprises haplotypesequence information.

Some embodiments of the methods and compositions provided herein includea method for preparing a library of template nucleic acids to obtainsequence information from a target nucleic acid, said method comprising:(a) preparing a template nucleic acid comprising a plurality oftransposomes inserted into said target nucleic acid, wherein at leastsome of the inserted transposome each comprise a first transposonsequence, a second transposon sequence noncontiguous with said firsttransposon sequence, and a transposase associated with the firsttransposon sequence and the second transposon sequence; and (b)compartmentalizing the template nucleic acid comprising said pluralityof inserted transposomes into each vessel of a plurality of vessels; and(c) removing the transposase from the template nucleic acid.

In some embodiments, step (b) comprises providing each vessel with anamount of template nucleic acid equal to less than one haploidequivalent, about one equivalent, or more than one equivalent of thetarget nucleic acid.

In some embodiments, step (b) comprises providing each vessel with anamount of template nucleic acid less than about one haploid equivalentof the target nucleic acid.

In some embodiments, step (c) comprises a method selected from the groupconsisting of adding a detergent, changing temperature, changing pH,adding a protease, adding a chaperone, and adding a polymerase.

In some embodiments, the first transposon sequence comprises a firstprimer site and the second transposon sequences comprise a second primersite.

In some embodiments, the first primer site further comprises a firstbarcode and the second primer site further comprises a second barcode.

In some embodiments, the first barcode and second barcode are different.

In some embodiments, the target nucleic acid comprises an amplifiednucleic acid.

In some embodiments, the target nucleic acid is obtained by enriching aplurality of nucleic acids for a sequence of interest.

In some embodiments, step (a) further comprises enriching the templatenucleic acid for a sequence of interest.

In some embodiments, the target nucleic acid comprises genomic DNA.

In some embodiments, the sequence information comprises haplotypesequence information.

Some embodiments of the methods and compositions provided herein includea library of template nucleic acids prepared according to any one of theforegoing methods.

Some embodiments of the methods and compositions provided herein includea method of obtaining sequence information from a target nucleic acid,said method comprising: (a) compartmentalizing the target nucleic acidinto a plurality of first vessels; (b) providing a first index to thetarget nucleic acid of each first vessel, thereby obtaining a firstindexed nucleic acid; (c) combining the first indexed nucleic acids; (d)compartmentalizing the first indexed template nucleic acids into aplurality of second vessels; (e) providing a second index to the firstindexed template nucleic of each second vessel, thereby obtaining asecond indexed nucleic acid; and (f) obtaining sequence information fromthe second indexed nucleic acid of each second vessel.

In some embodiments, step (b) comprises contacting the target nucleicacid with a plurality of transposomes each comprising a transposase anda transposon sequence comprising the first index under conditions suchthat at least some of the transposon sequences insert into the targetnucleic acid.

In some embodiments, step (b) comprises contacting the target nucleicacid with a plurality of transposomes each transposon comprising a firsttransposon sequence comprising a first index, a second transposonsequence noncontiguous with said first transposon sequence, and atransposase associated with the first transposon sequence and the secondtransposon sequence.

In some embodiments, step (d) comprises removing the transposase fromthe compartmentalized first indexed template nucleic acids.

In some embodiments, the transposase is removed subsequent to step (b).

In some embodiments, the transposase is removed prior to step (f).

In some embodiments, removing the transposase comprises a methodselected from the group consisting of adding a detergent, changingtemperature, changing pH, adding a protease, adding a chaperone, andadding a strand-displacing polymerase.

In some embodiments, the first transposon sequences comprises a firstprimer site and the second transposon sequences comprises a secondprimer site.

In some embodiments, the first primer site further comprises a firstbarcode and the second primer site further comprises a second barcode.

In some embodiments, the first barcode and second barcode are different.

In some embodiments, step (b) comprises amplifying the target nucleicacid with at least one primer comprising the first index.

In some embodiments, step (b) comprises ligating the target nucleic acidwith at least one primer comprising the first index.

In some embodiments, the first index provided to the target nucleic acidof each first vessel is different.

In some embodiments, step (a) comprises providing each first vessel withan amount of target nucleic acid greater than about one or more haploidequivalents of the target nucleic acid.

In some embodiments, step (d) comprises providing each vessel with anamount of the first indexed template nucleic acids greater than aboutone or more haploid equivalents of the target nucleic acid.

In some embodiments, step (e) comprises amplifying the first indexedtemplate nucleic with at least one primer comprising the second index.

In some embodiments, step (e) comprises ligating the first indexedtemplate nucleic with at least one primer comprising the second index.

In some embodiments, the second index provided to the first indexedtemplate nucleic of each second vessel is different.

In some embodiments, the target nucleic acid comprises an amplifiednucleic acid.

In some embodiments, the target nucleic acid is obtained by enriching aplurality of nucleic acids for a sequence of interest.

In some embodiments, the target nucleic acid comprises genomic DNA.

In some embodiments, step (f) further comprises assembling from sequencedata a representation of at least a portion of said template nucleicacid from each vessel.

Some embodiments of the methods and compositions provided herein includea method preparing a library of template nucleic acids to obtainsequence information from a target nucleic acid, said method comprising:(a) compartmentalizing the target nucleic acid into a plurality of firstvessels; (b) providing a first index to the target nucleic acid of eachfirst vessel, thereby obtaining a first indexed nucleic acid; (c)combining the first indexed nucleic acids; (d) compartmentalizing thefirst indexed template nucleic acids into a plurality of second vessels;and (e) providing a second index to the first indexed template nucleicof each second vessel, thereby obtaining a second indexed nucleic acid.

In some embodiments, step (b) comprises contacting the target nucleicacid with a plurality of transposomes each comprising a transposase anda transposon sequence comprising the first index under conditions suchthat at least some of the transposon sequences insert into the targetnucleic acid.

In some embodiments, step (b) comprises contacting the target nucleicacid with a plurality of transposomes each transposon comprising a firsttransposon sequence comprising a first index, a second transposonsequence noncontiguous with said first transposon sequence, and atransposase associated with the first transposon sequence and the secondtransposon sequence.

In some embodiments, step (d) comprises removing the transposase fromthe compartmentalized first indexed template nucleic acids.

In some embodiments, removing the transposase comprises a methodselected from the group consisting of adding a detergent, changingtemperature, changing pH, adding a protease, adding a chaperone, andadding a polymerase.

In some embodiments, the first transposon sequences comprises a firstprimer site and the second transposon sequences comprises a secondprimer site.

In some embodiments, the first primer site further comprises a firstbarcode and the second primer site further comprises a second barcode.

In some embodiments, the first barcode and second barcode are different.

In some embodiments, step (b) comprises amplifying the target nucleicacid with at least one primer comprising the first index.

In some embodiments, step (b) comprises ligating the target nucleic acidwith at least one primer comprising the first index.

In some embodiments, the first index provided to the target nucleic acidof each first vessel is different.

In some embodiments, step (a) comprises providing each first vessel withan amount of target nucleic acid greater than about one or more haploidequivalents of the target nucleic acid.

In some embodiments, step (d) comprises providing each vessel with anamount of the first indexed template nucleic acids greater than aboutone or more haploid equivalents of the target nucleic acid.

In some embodiments, step (e) comprises amplifying the first indexedtemplate nucleic with at least one primer comprising the second index.

In some embodiments, step (e) comprises ligating the first indexedtemplate nucleic with at least one primer comprising the second index.

In some embodiments, the second index provided to the first indexedtemplate nucleic of each second vessel is different.

In some embodiments, the target nucleic acid comprises an amplifiednucleic acid.

In some embodiments, the target nucleic acid is obtained by enriching aplurality of nucleic acids for a sequence of interest either before orafter transposition.

In some embodiments, the target nucleic acid comprises genomic DNA.

Some embodiments of the methods and compositions provided herein includea library of template nucleic acids prepared according to any one of theforegoing methods.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a schematic of a transposome comprising a dimerictransposase and two non-contiguous transposon sequences, and atransposome comprising a dimeric transposase and a contiguous transposonsequence.

FIG. 2 depicts a method of preparing a transposome with a linkercomprising a complementary double-stranded sequence.

FIG. 3 depicts an embodiment of making a template library usingtransposomes comprising transposon sequences comprising a singlestranded linker coupling the two transposon sequences in eachtransposome in a 5′-5′ orientation. Sequences are extended using primers

FIG. 4 depicts a scheme for preparing template nucleic acids to obtainsequence information in which a target nucleic acid is compartmentalizedinto 96 tubes, indexed by insertion of Tn5 derived transposons, indexednucleic acids are combined, and further compartmentalized into 96 tubes,further indexed by amplification, the twice indexed nucleic acids maythen be combined.

FIG. 5 depicts a schematic embodiment for obtaining haplotype sequenceinformation in which a template nucleic acid is indexed with the barcodeof a transposon, and with a primer. A template nucleic acid is preparedby insertion of a looped transposon into a target nucleic acid. Thetemplate nucleic acid is diluted into compartments. The template nucleicacid of each compartment is indexed by amplification with a primer.Indexed template nucleic acids are sequenced, aligned, and sequencerepresentation is obtained.

FIG. 6 depicts a scheme that includes preparing a target nucleic acidusing matepair and rolling circle amplification, followed by insertionof transposomes into the target nucleic acid, dilution of the targetnucleic acid to obtain haplotype information, removal of the transposaseby addition of SDS, generation of shotgun libraries, indexing andsequencing.

FIG. 7 depicts a scheme that includes preparing a target nucleic acidusing hairpin transposition and rolling circle amplification, followedby insertion of transposomes into the target nucleic acid, dilution ofthe target nucleic acid, removal of the transposase by addition of SDS,generation of shotgun libraries, indexing and sequencing to obtainhaplotype information.

FIG. 8 depicts an example scheme for generation of mate pair libraries.

FIG. 9 depicts an example scheme for generation of mate pair libraries.

FIG. 10 is a graph depicting a model of error rate in sequenceinformation for the number of times a particular sequence associatedwith a barcode is sequenced.

FIG. 11 depicts images of agarose gels showing oligonucleotides linkedwith 5′-5′ bisoxyamine coupling, in which looped precursor transposonsare indicated by the dimer band.

FIG. 12 is an image of an agarose gel showing the apparent molecularweight of a transposed target nucleic acid associated with transposase(left lane), and without transposase (+0.1% SDS, middle lane).

FIG. 13 summarizes that haplotype blocks up to 100 kb were observed forsamples in which transposase was removed by SDS post-dilution.

FIG. 14 depicts a graph showing the frequencies of sequencing reads forparticular distances between neighboring aligned reads for templatenucleic acids prepared by adding SDS to remove transposase beforedilution to obtain haplotype information, or after dilution to obtainhaplotype information.

FIG. 15 shows a graph of barcode indices and proportion of reads anddemonstrates that all 9216 different compartments in a 96×96 indexingscheme were observed.

FIG. 16 depicts a Pile up analysis of haplotype information obtainedusing transposomes comprising Mu.

DETAILED DESCRIPTION

Embodiments of the present invention relate to sequencing nucleic acids.In particular, embodiments of the methods and compositions providedherein relate to preparing nucleic acid templates and obtaining sequencedata therefrom. Methods and compositions provided herein are related tothe methods and compositions provided in U.S. Patent Application Pub.No. 2012/0208705, U.S. Patent Application Pub. No. 2012/0208724 and Int.Patent Application Pub. No. WO 2012/061832, each of which isincorporated by reference in its entirety. Some embodiments of thepresent invention relate to preparing templates to obtain haplotypesequence information from a target nucleic acid, and obtaining haplotypesequence information from such templates. More embodiments relate topreparing templates to obtain sequence information from a strand of adouble-stranded target nucleic acid, and obtaining sequence informationfrom such templates. Particular embodiments provided herein relate tothe use of integrases, for example transposases, to maintain physicalproximity of associated ends of fragmented nucleic acids; and to the useof virtual compartments to enable the use of high concentrations ofnucleic acids.

Obtaining haplotype information from a target nucleic acid includesdistinguishing between different alleles (e.g., SNPs, genetic anomalies,etc.) in a target nucleic acid. Such methods are useful to characterizedifferent alleles in a target nucleic acid, and to reduce the error ratein sequence information. Generally, methods to obtain haplotype sequenceinformation include obtaining sequence information for a portion of atemplate nucleic acid. In one embodiment, a template nucleic acid can bediluted and sequence information obtained from an amount of templatenucleic acid equivalent to about a haplotype of the target nucleic acid.

In further embodiments, a template nucleic acid can be compartmentalizedsuch that multiple copies of a chromosome can be present in the samecompartment, as a result of dual or multiple indexing provided herein, ahaplotype can still also be determined. In other words, a templatenucleic acid can be prepared using virtual compartments. In suchembodiments, a nucleic acid can be distributed between several firstcompartments, providing a first index to the nucleic acid of eachcompartment, combining the nucleic acids, distributing the nucleic acidbetween several second compartments, and providing a second index to thenucleic acid of each compartment. Advantageously, such indexing enableshaplotype information to be obtained at higher concentrations of nucleicacid compared to the mere dilution of a nucleic acid in a singlecompartment to an amount equivalent to a haplotype of the nucleic acid.

In some embodiments provided herein, template libraries are preparedusing transposomes. In some such libraries, the target nucleic acid maybe fragmented. Accordingly, some embodiments provided herein relate tomethods for maintaining sequence information for the physical contiguityof adjacent fragments. Such methods include the use of integrases tomaintain the association of template nucleic acid fragments adjacent inthe target nucleic acid. Advantageously, such use of integrases tomaintain physical proximity of fragmented nucleic acids increases thelikelihood that fragmented nucleic acids from the same originalmolecule, e,g, chromosome, will occur in the same compartment.

Other embodiments provided herein relate to obtaining sequenceinformation from each strand of a nucleic acid which can be useful toreduce the error rate in sequencing information. Methods to preparelibraries of template nucleic acids for obtaining sequence informationfrom each strand of a nucleic acid can be prepared such that each strandcan be distinguished, and the products of each strand can also bedistinguished.

Some of the methods provided herein include methods of analyzing nucleicacids. Such methods include preparing a library of template nucleicacids of a target nucleic acid, obtaining sequence data from the libraryof template nucleic acids, and assembling a sequence representation ofthe target nucleic acid from such sequence data.

Generally, the methods and compositions provided herein are related tothe methods and compositions provided in U.S. Patent Application Pub.No. 2012/0208705, U.S. Patent Application Pub. No. 2012/0208724 and Int.Patent Application Pub. No. WO 2012/061832, each of which isincorporated by reference in its entirety. The methods provided hereinrelate to the use of transposomes useful to insert features into atarget nucleic acid. Such features including fragmentation sites, primersites, barcodes, affinity tags, reporter moieties, etc.

In a method useful with the embodiments provided herein, a library oftemplate nucleic acids is prepared from a target nucleic acid. Thelibrary is prepared by inserting a plurality of unique barcodesthroughout the target nucleic acid. In some embodiments, each barcodeincludes a first barcode sequence and a second barcode sequence, havinga fragmentation site disposed therebetween. The first barcode sequenceand second barcode sequence can be identified or designated to be pairedwith one another. The pairing can be informative so that a first barcodeis associated with a second barcode. Advantageously, the paired barcodesequences can be used to assemble sequencing data from the library oftemplate nucleic acids. For example, identifying a first templatenucleic acid comprising a first barcode sequence and a second templatenucleic acid comprising a second barcode sequence that is paired withthe first indicates that the first and second template nucleic acidsrepresent sequences adjacent to one another in a sequence representationof the target nucleic acid. Such methods can be used to assemble asequence representation of a target nucleic acid de novo, without therequirement of a reference genome.

Definitions

As used herein the term “nucleic acid” and/or “oligonucleotide” and/orgrammatical equivalents thereof can refer to at least two nucleotidemonomers linked together. A nucleic acid can generally containphosphodiester bonds; however, in some embodiments, nucleic acid analogsmay have other types of backbones, comprising, for example,phosphoramide (Beaucage, et al., Tetrahedron, 49:1925 (1993); Letsinger,J. Org. Chem., 35:3800 (1970); Sprinzl, et al., Eur. J. Biochem., 81:579(1977); Letsinger, et al., Nucl. Acids Res., 14:3487 (1986); Sawai, etal., Chem. Lett., 805 (1984), Letsinger, et al., J. Am. Chem. Soc.,110:4470 (1988); and Pauwels, et al., Chemica Scripta, 26:141 (1986)),phosphorothioate (Mag, et al., Nucleic Acids Res., 19:1437 (1991); andU.S. Pat. No. 5,644,048), phosphorodithioate (Briu, et al., J. Am. Chem.Soc., 111:2321 (1989), O-methylphosphoroamidite linkages (see Eckstein,Oligonucleotides and Analogues: A Practical Approach, Oxford UniversityPress), and peptide nucleic acid backbones and linkages (see Egholm, J.Am. Chem. Soc., 114:1895 (1992); Meier, et al., Chem. Int. Ed. Engl.,31:1008 (1992); Nielsen, Nature, 365:566 (1993); Carlsson, et al.,Nature, 380:207 (1996)).

Other analog nucleic acids include those with positive backbones(Denpcy, et al., Proc. Natl. Acad. Sci. USA, 92:6097 (1995)); non-ionicbackbones (U.S. Pat. Nos. 5,386,023; 5,637,684; 5,602,240; 5,216,141;and 4,469,863; Kiedrowshi, et al., Angew. Chem. Intl. Ed. English,30:423 (1991); Letsinger, et al., J. Am. Chem. Soc., 110:4470 (1988);Letsinger, et al., Nucleosides & Nucleotides, 13:1597 (1994); Chapters 2and 3, ASC Symposium Series 580, “Carbohydrate Modifications inAntisense Research”, Ed. Y. S. Sanghui and P. Dan Cook; Mesmaeker, etal., Bioorganic & Medicinal Chem. Lett., 4:395 (1994); Jeffs, et al., J.Biomolecular NMR, 34:17 (1994); Tetrahedron Lett., 37:743 (1996)) andnon-ribose (U.S. Pat. Nos. 5,235,033 and 5,034,506, and Chapters 6 and7, ASC Symposium Series 580, “Carbohydrate Modifications in AntisenseResearch”, Ed, Y, S. Sanghui and P. Dan Coo). Nucleic acids may alsocontain one or more carbocyclic sugars (see Jenkins, et al., Chem. Soc.Rev., (1995) pp. 169 176).

Modifications of the ribose-phosphate backbone may be done to facilitatethe addition of additional moieties such as labels, or to increase thestability of such molecules under certain conditions. In addition,mixtures of naturally occurring nucleic acids and analogs can be made.Alternatively, mixtures of different nucleic acid analogs, and mixturesof naturally occurring nucleic acids and analogs may be made. Thenucleic acids may be single stranded or double stranded, as specified,or contain portions of both double stranded or single stranded sequence.The nucleic acid may be DNA, for example, genomic or cDNA, RNA or ahybrid, from single cells, multiple cells, or from multiple species, aswith metagenomic samples, such as from environmental samples, furtherfrom mixed samples for example mixed tissue samples or mixed samples fordifferent individuals of the same species, disease samples such ascancer related nucleic acids, and the like. A nucleic acid can containany combination of deoxyribo- and ribo-nucleotides, and any combinationof bases, including uracil, adenine, thymine, cytosine, guanine,inosine, xanthanine, hypoxanthanine, isocytosine, isoguanine, and baseanalogs such as nitropyrrole (including 3-nitropyrrole) and nitroindole(including 5-nitroindole), etc.

In some embodiments, a nucleic acid can include at least one promiscuousbase. Promiscuous bases can base-pair with more than one different typeof base. In some embodiments, a promiscuous base can base-pair with atleast two different types of bases and no more than three differenttypes of bases. An example of a promiscuous base includes inosine thatmay pair with adenine, thymine, or cytosine. Other examples includehypoxanthine, 5-nitroindole, acylic 5-nitroindole, 4-nitropyrazole,4-nitroimidazole and 3-nitropyrrole (Loakes et al., Nucleic Acid Res.22:4039 (1994); Van Aerschot et al., Nucleic Acid Res. 23:4363 (1995);Nichols et al., Nature 369:492 (1994); Bergstrom et al., Nucleic AcidRes. 25:1935 (1997); Loakes et al., Nucleic Acid Res. 23:2361 (1995);Loakes et al., J. Mol. Biol. 270:426 (1997); and Fotin et al., NucleicAcid Res. 26:1515 (1998)). Promiscuous bases that can base-pair with atleast three, four or more types of bases can also be used.

As used herein, the term “nucleotide analog” and/or grammaticalequivalents thereof can refer to synthetic analogs having modifiednucleotide base portions, modified pentose portions, and/or modifiedphosphate portions, and, in the case of polynucleotides, modifiedinternucleotide linkages, as generally described elsewhere (e.g.,Scheit, Nucleotide Analogs, John Wiley, New York, 1980; Englisch, Angew.Chem. Int. Ed. Engl. 30:613-29, 1991; Agarwal, Protocols forPolynucleotides and Analogs, Humana Press, 1994; and S. Verma and F.Eckstein, Ann. Rev. Biochem. 67:99-134, 1998). Generally, modifiedphosphate portions comprise analogs of phosphate wherein the phosphorousatom is in the +5 oxidation state and one or more of the oxygen atoms isreplaced with a non-oxygen moiety, e.g., sulfur. Exemplary phosphateanalogs include but are not limited to phosphorothioate,phosphorodithioate, phosphoroselenoate, phosphorodiselcnoate,phosphoroanilothioate, phosphoranilidate, phosphoramidate,boronophosphates, including associated counterions, e.g., H⁺, NH₄ ⁺,Na⁺, if such counterions are present. Example modified nucleotide baseportions include but are not limited to 5-methylcytosine (5mC);C-5-propynyl analogs, including but not limited to, C-5 propynyl-C andC-5 propynyl-U; 2,6-diaminopurine, also known as 2-amino adenine or2-amino-dA); hypoxanthine, pseudouridine, 2-thiopyrimidine, isocytosine(isoC), 5-methyl isoC, and isoguanine (isoG; see, e.g., U.S. Pat. No.5,432,272). Exemplary modified pentose portions include hut are notlimited to, locked nucleic acid (LNA) analogs including withoutlimitation Bz-A-LNA, 5-Me-Bz-C-LNA, dmf-G-LNA, and T-LNA (see, e.g., TheGlen Report, 16(2):5, 2003; Koshkin et al., Tetrahedron 54:3607-30,1998), and 2′- or 3′-modifications where the 2′- or 3′-position ishydrogen, hydroxy, alkoxy (e.g., methoxy, ethoxy, allyloxy, isopropoxy,butoxy, isobutoxy and phenoxy), azido, amino, alkylamino, fluoro,chloro, or bromo. Modified internucleotide linkages include phosphateanalogs, analogs having achiral and uncharged intersubunit linkages(e.g., Sterchak, E. P. et al., Organic Chem., 52:4202, 1987), anduncharged morpholino-based polymers having achiral intersubunit linkages(see, e.g., U.S. Pat. No. 5,034,506). Some internucleotide linkageanalogs include morpholidate, acetal, and polyamide-linked heterocycles.In one class of nucleotide analogs, known as peptide nucleic acids,including pseudocomplementary peptide nucleic acids (“PNA”), aconventional sugar and internucleotide linkage has been replaced with a2-aminoethylglycine amide backbone polymer (see, e.g., Nielsen et al.,Science, 254:1497-1500, 1991; Egholm et al., J. Am. Chem. Soc., 114:1895-1897 1992; Demidov et al., Proc. Natl. Acad. Sci. 99:5953-58, 2002;Peptide Nucleic Acids: Protocols and Applications, Nielsen, ed., HorizonBioscience, 2004),

As used herein, the term “sequencing read” and/or grammaticalequivalents thereof can refer to a repetitive process of physical orchemical steps that is carried out to obtain signals indicative of theorder of monomers in a polymer. The signals can be indicative of anorder of monomers at single monomer resolution or lower resolution. Inparticular embodiments, the steps can be initiated on a nucleic acidtarget and carried out to obtain signals indicative of the order ofbases in the nucleic acid target. The process can be carried out to itstypical completion, which is usually defined by the point at whichsignals from the process can no longer distinguish bases of the targetwith a reasonable level of certainty. If desired, completion can occurearlier, for example, once a desired amount of sequence information hasbeen obtained. A sequencing read can be carried out on a single targetnucleic acid molecule or simultaneously on a population of targetnucleic acid molecules having the same sequence, or simultaneously on apopulation of target nucleic acids having different sequences. In someembodiments, a sequencing read is terminated when signals are no longerobtained from one or more target nucleic acid molecules from whichsignal acquisition was initiated. For example, a sequencing read can beinitiated for one or more target nucleic acid molecules that are presenton a solid phase substrate and terminated upon removal of the one ormore target nucleic acid molecules from the substrate. Sequencing can beterminated by otherwise ceasing detection of the target nucleic acidsthat were present on the substrate when the sequencing run wasinitiated.

As used herein, the term “sequencing representation” and/or grammaticalequivalents thereof can refer to information that signifies the orderand type of monomeric units in the polymer. For example, the informationcan indicate the order and type of nucleotides in a nucleic acid. Theinformation can be in any of a variety of formats including, forexample, a depiction, image, electronic medium, series of symbols,series of numbers, series of letters, series of colors, etc. Theinformation can be at single monomer resolution or at lower resolution.An exemplary polymer is a nucleic acid, such as DNA or RNA, havingnucleotide units. A series of “A,” “T,” “G,” and “C” letters is awell-known sequence representation for DNA that can be correlated, atsingle nucleotide resolution, with the actual sequence of a DNAmolecule. Other exemplary polymers are proteins having amino acid unitsand polysaccharides having saccharide units.

As used herein the term “at least a portion” and/or grammaticalequivalents thereof can refer to any fraction of a whole amount. Forexample, “at least a portion” can refer to at least about 1%, 2%, 3%,4%, 5%, 6%, 7%, 8%, 9%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%,55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, 99%, 99.9% or 100% of awhole amount.

Transposomes

A “transposome” comprises an integration enzyme such as an integrase ortransposase, and a nucleic acid comprising an integration recognitionsite, such as a transposase recognition site. In embodiments providedherein, the transposase can form a functional complex with a transposaserecognition site that is capable of catalyzing a transposition reaction.The transposase may bind to the transposase recognition site and insertthe transposase recognition site into a target nucleic acid in a processsometimes termed “tagmentation”. In some such insertion events, onestrand of the transposase recognition site may be transferred into thetarget nucleic acid. FIG. 1 depicts two examples of transposomes. In oneexample, a transposome (10) comprises a dimeric transposase comprisingtwo subunits (20), and two non-contiguous transposon sequences (30). Inanother example, a transposome (50) comprises a transposase comprises adimeric transposase comprising two subunits (60), and a contiguoustransposon sequence (70).

Some embodiments can include the use of a hyperactive Tn5 transposaseand a Tn5-type transposase recognition site (Goryshin and Reznikoff, J.Biol. Chem., 273:7367 (1998)), or MuA transposase and a Mu transposaserecognition site comprising R1 and R2 end sequences (Mizuuchi, K., Cell,35: 785, 1983; Savilahti, H, et al., EMBO J., 14: 4893, 1995). MEsequences can also be used as optimized by a skilled artisan.

More examples of transposition systems that can be used with certainembodiments of the compositions and methods provided herein includeStaphylococcus aureus Tn552 (Colegio et al., J. Bacteriol., 183: 2384-8,2001; Kirby C et al., Mol. Microbiol., 43: 173-86, 2002), Ty1 (Devine &Boeke, Nucleic Acids Res., 22: 3765-72, 1994 and InternationalPublication WO 95/23875), Transposon Tn7 (Craig, N L, Science. 271:1512, 1996; Craig, N L, Review in: Curr Top Microbiol Immunol.,204:27-48, 1996), Tn/O and IS10 (Kleckner N, et al., Curr Top MicrobiolImmunol., 204:49-82, 1996), Mariner transposase (Lampe D J, et al., EMBOJ., 15: 5470-9, 1996), Tc1 (Plasterk R H, Curr. Topics Microbiol.Immunol., 204: 125-43, 1996), P Element (Gloor, G B, Methods Mol. Biol.,260: 97-114, 2004), Tn3 (Ichikawa & Ohtsubo, J Biol. Chem. 265:18829-32,1990), bacterial insertion sequences (Ohtsubo & Sekine, Curr. Top.Microbiol. Immunol. 204: 1-26, 1996), retroviruses (Brown, et al., ProcNatl Acad Sci USA, 86:2525-9, 1989), and retrotransposon of yeast (Boeke& Corces, Annu Rev Microbiol. 43:403-34, 1989). More examples includeIS5, Tn10, Tn903, IS911, and engineered versions of transposase familyenzymes (Zhang et al., (2009) PLoS Genet. 5:e1000689. Epub 2009 Oct. 16;Wilson C. et al (2007) J. Microbiol. Methods 71:332-5).

More examples of integrases that may be used with the methods andcompositions provided herein include retroviral integrases and integraserecognition sequences for such retroviral integrases, such as integrasesfrom HIV-1, HIV-2, SIV, PFV-1, RSV.

Transposon Sequences

Some embodiments of the compositions and methods provided herein includetransposon sequences. In some embodiments, a transposon sequenceincludes at least one transposase recognition site. In some embodiments,a transposon sequence includes at least one transposase recognition siteand at least one barcode. Transposon sequences useful with the methodsand compositions provided herein are provided in U.S. Patent ApplicationPub. No. 2012/0208705, U.S. Patent Application Pub. No. 2012/0208724 andInt. Patent Application Pub. No. WO 2012/061832, each of which isincorporated by reference in its entirety. In some embodiments, atransposon sequence includes a first transposase recognition site, asecond transposase recognition site, and a barcode or barcodes disposedtherebetween.

Transposomes with Non-Contiguous Transposon Sequences

Some transposomes provided herein include a transposase comprising twotransposon sequences. In some such embodiments, the two transposonsequences are not linked to one another, in other words, the transposonsequences are non-contiguous with one another. Examples of suchtransposomes are known in the art, see e.g., U.S. Patent ApplicationPub. No. 2010/0120098, the disclosure of which is incorporated herein byreference in its entirety. FIG. 1 depicts an example transposome (10)comprising a dimeric transposase (20) and two transposon sequences (30).

Looped Structures

In some embodiments, a transposome comprises a transposon sequencenucleic acid that binds two transposase subunits to form a “loopedcomplex” or a “looped transposome.” In essence, a transposase complexwith contiguous transposons. FIG. 1 depicts an example transposome (50)comprising a dimeric transposase (60) and a transposon sequence (70).Looped complexes can ensure that transposons are inserted into targetDNA while maintaining ordering information of the original target DNAand without fragmenting the target DNA. As will be appreciated, loopedstructures may insert primers, barcodes, indexes and the like into atarget nucleic acid, while maintaining physical connectivity of thetarget nucleic acid. In some embodiments, the transposon sequence of alooped transposome can include a fragmentation site such that thetransposon sequence can be fragmented to create a transposome comprisingtwo transposon sequences. Such transposomes are useful to ensuring thatneighboring target DNA fragments, in which the transposons insert,receive code combinations that can be unambiguously assembled at a laterstage of the assay.

Barcodes

Generally, a barcode can include one or more nucleotide sequences thatcan be used to identify one or more particular nucleic acids. Thebarcode can be an artificial sequence, or can be a naturally occurringsequence generated during transposition, such as identical flankinggenomic DNA sequences (g-codes) at the end of formerly juxtaposed DNAfragments. A barcode can comprise at least about 1, 2, 3, 4, 5, 6, 7, 8,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more consecutivenucleotides. In sonic embodiments, a barcode comprises at least about10, 20, 30, 40, 50, 60, 70 80, 90, 100 or more consecutive nucleotides.In some embodiments, at least a portion of the barcodes in a populationof nucleic acids comprising barcodes is different. In some embodiments,at least about 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 99% ofthe barcodes are different. In more such embodiments, all of thebarcodes are different. The diversity of different barcodes in apopulation of nucleic acids comprising barcodes can be randomlygenerated or non-randomly generated.

In some embodiments, a transposon sequence comprises at least onebarcode. In some embodiments, such as transposomes comprising twonon-contiguous transposon sequences, the first transposon sequencecomprises a first barcode, and the second transposon sequence comprisesa second barcode. In some embodiments, such as in looped transposomes, atransposon sequence comprises a barcode comprising a first barcodesequence and a second barcode sequence. In some of the foregoingembodiments, the first barcode sequence can be identified or designatedto be paired with the second barcode sequence. For example, a knownfirst barcode sequence can be known to be paired with a known secondbarcode sequence using a reference table comprising a plurality of firstand second bar code sequences known to be paired to one another.

In another example, the first barcode sequence can comprise the samesequence as the second barcode sequence. In another example, the firstbarcode sequence can comprise the reverse complement of the secondbarcode sequence. In some embodiments, the first barcode sequence andthe second barcode sequence are different. The first and second barcodesequences may comprise a bi-code.

In some embodiments of compositions and methods described herein,barcodes are used in the preparation of template nucleic acids. As willbe understood, the vast number of available barcodes permits eachtemplate nucleic acid molecule to comprise a unique identification.Unique identification of each molecule in a mixture of template nucleicacids can be used in several applications. For example, uniquelyidentified molecules can be applied to identify individual nucleic acidmolecules, in samples having multiple chromosomes, in genomes, in cells,in cell types, in cell disease states, and in species, for example, inhaplotype sequencing, in parental allele discrimination, in metagenomicsequencing, and in sample sequencing of a genome.

Linkers

Some embodiments comprising looped transposomes where a transposase iscomplexed with contiguous transposons include transposon sequencescomprising a first barcode sequence and a second barcode sequence havinga linker disposed therebetween. In other embodiments, the linker can beabsent, or can be the sugar-phosphate backbone that connects onenucleotide to another. The linker can comprise, for example, one or moreof a nucleotide, a nucleic acid, a non-nucleotide chemical moiety, anucleotide analogue, amino acid, peptide, polypeptide, or protein. Inpreferred embodiments, a linker comprises a nucleic acid. The linker cancomprise at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,15, 16, 17, 18, 19, 20, or more nucleotides. In some embodiments, alinker can comprise at least about 10, 20, 30, 40, 50, 60, 70, 80, 90,100, 200, 300, 400, 500 or more nucleotides.

In some embodiments, a linker can be amplifiable for example by PCR,rolling circle amplification, strand displacement amplification, and thelike. In other embodiments, a linker can comprise non-amplifiablemoieties. Examples of non-amplifiable linkers include organic chemicallinkers such as alkyl, propyl, PEG; non-natural bases such as IsoC,isoG; or any group that does not amplify in DNA-based amplificationschemes. For example, transposons containing isoC, isoG pairs can beamplified with dNTPs mixtures lacking a complementary isoG and isoC,ensuring that no amplification occurs across the inserted transposons.

In some embodiments, the linker comprises a single-stranded nucleicacid. In some embodiments, the linker couples transposon sequences in a5′-3′ orientation, a 5′-5′ orientation, or a 3′-3′ orientation.

Fragmentation Sites

In some embodiments comprising looped transposomes the linker cancomprise a fragmentation site. A fragmentation site can be used tocleave the physical, but not the informational association between afirst barcode sequence and a second barcode sequence. Cleavage may be bybiochemical, chemical or other means. In some embodiments, afragmentation site can include a nucleotide or nucleotide sequence thatmay be fragmented by various means. For example, a fragmentation sitemay comprise a restriction endonuclease site; at least oneribonucleotide cleavable with an RNAse; nucleotide analogues cleavablein the presence of certain chemical agent; a diol linkage cleavable bytreatment with periodate; a disulphide group cleavable with a chemicalreducing agent; a cleavable moiety that may be subject to photochemicalcleavage; and a peptide cleavable by a peptidase enzyme or othersuitable means. See e.g., U.S. Patent Application Pub. No. 2012/0208705,U.S. Patent Application Pub. No. 2012/0208724 and Int. PatentApplication Pub. No. WO 2012/061832, each of which is incorporated byreference in its entirety.

Primer Sites

In some embodiments, a transposon sequence can include a “sequencingadaptor” or “sequencing adaptor site”, that is to say a region thatcomprises one or more sites that can hybridize to a primer. In someembodiments, a transposon sequence can include at least a first primersite useful for amplification, sequencing, and the like. In someembodiments comprising looped transposomes, a linker can include asequencing adaptor. In more embodiments comprising looped transposomes,a linker comprises at least a first primer site and a second primersite. The orientation of the primer sites in such embodiments can besuch that a primer hybridizing to the first primer site and a primerhybridizing to the second primer site are in the same orientation, or indifferent orientations.

In some embodiments, a linker can include a first primer site, a secondprimer site having a non-amplifiable site disposed therebetween. Thenon-amplifiable site is useful to block extension of a polynucleotidestrand between the first and second primer sites, wherein thepolynucleotide strand hybridizes to one of the primer sites. Thenon-amplifiable site can also be useful to prevent concatamers. Examplesof non-amplifiable sites include a nucleotide analogue, non-nucleotidechemical moiety, amino-acid, peptide, and polypeptide. In someembodiments, a non-amplifiable site comprises a nucleotide analogue thatdoes not significantly base-pair with A, C, G or T. Some embodimentsinclude a linker comprising a first primer site, a second primer sitehaving a fragmentation site disposed therebetween. Other embodiments canuse a forked or Y-shaped adapter design useful for directionalsequencing, as described in U.S. Pat. No. 7,741,463, the disclosure ofwhich is incorporated herein by reference in its entirety.

Affinity Tags

In some embodiments, a transposon sequence or transposase can include anaffinity tag. In some embodiments comprising looped transposomes alinker can comprise an affinity tag. Affinity tags can be useful for avariety of applications, for example the bulk separation of targetnucleic acids hybridized to hybridization tags. Additional applicationinclude, but are not limited to, using affinity tags for purifyingtransposase/transposon complexes and transposon inserted target DNA, forexample. As used herein, the term “affinity tag” and grammaticalequivalents can refer to a component of a multi-component complex,wherein the components of the multi-component complex specificallyinteract with or bind to each other. For example an affinity tag caninclude biotin or poly-His that can bind streptavidin or nickel,respectively. Other examples of multiple-component affinity tagcomplexes are listed, for example, U.S. Patent Application Pub. No.2012/0208705, U.S. Patent Application Pub. No. 2012/0208724 and Int.Patent Application Pub. No. WO 2012/061832, each of which isincorporated by reference in its entirety.

Reporter Moieties

In some embodiments of the compositions and methods described herein, atransposon sequence or transposase can include a reporter moiety. Insome embodiments comprising looped transposomes a linker can comprise areporter moiety. As used herein, the term “reporter moiety” andgrammatical equivalents can refer to any identifiable tag, label, orgroup. The skilled artisan will appreciate that many different speciesof reporter moieties can be used with the methods and compositionsdescribed herein, either individually or in combination with one or moredifferent reporter moieties. In certain embodiments, a reporter moietycan emit a signal. Examples of a signal includes, but is not limited to,a fluorescent, a chemiluminescent, a bioluminescent, a phosphorescent, aradioactive, a calorimetric, an ion activity, an electronic or anelectrochemiluminescent signals. Example reporter moieties are listed,for example, U.S. Patent Application Pub. No. 2012/0208705, U.S. PatentApplication Pub. No. 2012/0208724 and Int. Patent Application Pub. No.WO 2012/061832, each of which is incorporated by reference in itsentirety.

Certain Methods of Making Transposon Sequences

The transposon sequences provided herein can be prepared by a variety ofmethods. Exemplary methods include direct synthesis , hairpin extensionmethods, and PCR. In some embodiments, transposon sequences may beprepared by direct synthesis. For example, a transposon sequencecomprising a nucleic acid may be prepared by methods comprising chemicalsynthesis. Such methods are well known in the art, e.g., solid phasesynthesis using phosphoramidite precursors such as those derived fromprotected 2′-deoxynucleosides, ribonucleosides, or nucleoside analogues.Example methods of preparing transposon sequencing can be found in, forexample, U.S. Patent Application Pub. No. 2012/0208705, U.S. PatentApplication Pub. No. 2012/0208724 and Int. Patent Application Pub. No.WO 2012/061832, each of which is incorporated by reference in itsentirety.

In some embodiments comprising looped transposomes transposon sequencescomprising a single stranded linker can be prepared. In someembodiments, the linker couples the transposon sequences of atransposome such that a transposon sequence comprising a firsttransposase recognition sequence is coupled to a second transposonsequence comprising a second transposase recognition sequence in a 5′ to3′ orientation. In some embodiments, the linker couples a transposonsequence comprising a first transposase recognition sequence to a secondtransposon sequence comprising a second transposase recognition sequencein a 5′ to 5′ orientation or in a 3′ to 3′ orientation. Couplingtransposon sequences of a transposome in either a 5′ to 5′ orientationor in a 3′ to 3′ orientation can be advantageous to prevent transposaserecognition elements, in particular mosaic elements (ME or M), frominteracting with one another. For example, coupled transposon sequencescan be prepared by preparing transposon sequences comprising either analdehyde group or oxyamine group. The aldehyde and oxyamine groups caninteract to form a covalent bond thus coupling the transposon sequences.

In some embodiments, transposomes comprising complementary sequences canbe prepared. FIG. 2 illustrates an embodiment in which a transposase isloaded with transposon sequences comprising complementary tails. Thetails hybridize to form a linked transposon sequence. Hybridization mayoccur in dilute conditions to decrease the likelihood of hybridizationbetween transposomes.

Target Nucleic Acids

A target nucleic acid can include any nucleic acid of interest. Targetnucleic acids can include DNA, RNA, peptide nucleic acid, morpholinonucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleicacid, mixed samples of nucleic acids, polyploidy DNA (i.e., plant DNA),mixtures thereof, and hybrids thereof. In a preferred embodiment,genomic DNA or amplified copies thereof are used as the target nucleicacid. In another preferred embodiment, cDNA, mitochondrial DNA orchloroplast DNA is used.

A target nucleic acid can comprise any nucleotide sequence. In someembodiments, the target nucleic acid comprises homopolymer sequences. Atarget nucleic acid can also include repeat sequences. Repeat sequencescan be any of a variety of lengths including, for example, 2, 5, 10, 20,30, 40, 50, 100, 250, 500 or 1000 nucleotides or more. Repeat sequencescan be repeated, either contiguously or non-contiguously, any of avariety of times including, for example, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15or 20 times or more.

Some embodiments described herein can utilize a single target nucleicacid. Other embodiments can utilize a plurality of target nucleic acids.In such embodiments, a plurality of target nucleic acids can include aplurality of the same target nucleic acids, a plurality of differenttarget nucleic acids where some target nucleic acids are the same, or aplurality of target nucleic acids where all target nucleic acids aredifferent. Embodiments that utilize a plurality of target nucleic acidscan be carried out in multiplex formats so that reagents are deliveredsimultaneously to the target nucleic acids, for example, in one or morechambers or on an array surface. In some embodiments, the plurality oftarget nucleic acids can include substantially all of a particularorganism's genome. The plurality of target nucleic acids can include atleast a portion of a particular organism's genome including, forexample, at least about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%,or 99% of the genome. In particular embodiments the portion can have anupper limit that is at most about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%,90%, 95%, or 99% of the genome

Target nucleic acids can be obtained from any source. For example,target nucleic acids may be prepared from nucleic acid moleculesobtained from a single organism or from populations of nucleic acidmolecules obtained from natural sources that include one or moreorganisms. Sources of nucleic acid molecules include, but are notlimited to, organelles, cells, tissues, organs, or organisms. Cells thatmay be used as sources of target nucleic acid molecules may beprokaryotic (bacterial cells, for example, Escherichia, Bacillus,Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium,Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella,Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium,Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota,nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (forexample, yeasts), plants, protozoans and other parasites, and animals(including insects (for example, Drosophila spp.), nematodes (e.g.,Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey,non-human primate and human). Target nucleic acids and template nucleicacids can be enriched for certain sequences of interest using variousmethods well known in the art. Examples of such methods are provided inInt. Pub. No. WO/2012/108864, which is incorporated herein by referencein its entirety. In some embodiments, nucleic acids may be furtherenriched during methods of preparing template libraries. For example,nucleic acids may be enriched for certain sequences, before insertion oftransposomes after insertion of transposomes and/or after amplificationof nucleic acids.

In addition, in some embodiments, target nucleic acids and/or templatenucleic acids can be highly purified, for example, nucleic acids can beat least about 70%, 80%, 90%, 95%, 96%, 97%, 98%, 99%, or 100% free fromcontaminants before use with the methods provided herein. In someembodiments, it is beneficial to use methods known in the art thatmaintain the quality and size of the target nucleic acid, for exampleisolation and/or direct transposition of target DNA may be performedusing agarose plugs. Transposition can also be performed directly incells, with population of cells, lysates, and non-purified DNA.

Certain Methods of Preparing Template Nucleic Acids

Some embodiments include methods of preparing template nucleic acids. Asused herein, “template nucleic acid” can refer to a substrate forobtaining sequence information. In some embodiments, a template nucleicacid can include a target nucleic acid, a fragment thereof, or any copythereof comprising at least one transposon sequence, a fragment thereof,or any copy thereof. In some embodiments, a template nucleic acid caninclude a target nucleic acid comprising a sequencing adaptor, such as asequencing primer site.

Some methods of preparing template nucleic acids include inserting atransposon sequence into a target nucleic acid, thereby preparing atemplate nucleic acid. Some methods of insertion include contacting atransposon sequence provided herein with a target nucleic acid in thepresence of an enzyme, such as a transposase or integrase, underconditions sufficient for the integration of the transposon sequence orsequences into the target nucleic acid.

In some embodiments, insertion of transposon sequences into a targetnucleic acid can be non-random. In some embodiments, transposonsequences can be contacted with target nucleic acids comprising proteinsthat inhibit integration at certain sites. For example, transposonsequences can be inhibited from integrating into genomic DNA comprisingproteins, genomic DNA comprising chromatin, genomic DNA comprisingnucleosomes, or genomic DNA comprising histones. In some embodiments,transposon sequences can be associated with affinity tags in order tointegrate the transposon sequence at a particular sequence in a targetnucleic acid. For example, a transposon sequence may be associated witha protein that targets specific nucleic acid sequences, e.g., histones,chromatin-binding proteins, transcription factors, initiation factors,etc., and antibodies or antibody fragments that bind to particularsequence-specific nucleic-acid-binding proteins. In an exemplaryembodiment, a transposon sequence is associated with an affinity tag,such as biotin; the affinity tag can be associated with anucleic-acid-binding protein.

It will be understood that during integration of some transposonsequences into a target nucleic acid, several consecutive nucleotides ofthe target nucleic acid at the integration site are duplicated in theintegrated product. Thus the integrated product can include a duplicatedsequence at each end of the integrated sequence in the target nucleicacid. As used herein, the term “host tag” or “g-tag” can refer to atarget nucleic acid sequence that is duplicated at each end of anintegrated transposon sequence. Single-stranded portions of nucleicacids that may be generated by the insertion of transposon sequences canbe repaired by a variety of methods well known in the art, for exampleby using ligases, oligonucleotides and/or polymerases.

In some embodiments, a plurality of the transposon sequences providedherein is inserted into a target nucleic acid. Some embodiments includeselecting conditions sufficient to achieve integration of a plurality oftransposon sequences into a target nucleic acid such that the averagedistance between each integrated transposon sequence comprises a certainnumber of consecutive nucleotides in the target nucleic acid.

Some embodiments include selecting conditions sufficient to achieveinsertion of a transposon sequence or sequences into a target nucleicacid, but not into another transposon sequence or sequences. A varietyof methods can be used to reduce the likelihood that a transposonsequence inserts into another transposon sequence. Examples of suchmethods useful with the embodiments provided herein can be found in forexample, U.S. Patent Application Pub. No. 2012/0208705, U.S. PatentApplication Pub. No. 2012/0208724 and Int. Patent Application Pub. No.WO 2012/061832, each of which is incorporated by reference in itsentirety.

In some embodiments, conditions may be selected so that the averagedistance in a target nucleic acid between integrated transposonsequences is at least about 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100,or more consecutive nucleotides. In some embodiments, the averagedistance in a target nucleic acid between integrated transposonsequences is at least about 100, 200, 300, 400, 500, 600, 700, 800, 900,1000, or more consecutive nucleotides. In some embodiments, the averagedistance in a target nucleic acid between integrated transposonsequences is at least about 1 kb, 2 kb, 3 kb, 4 kb, 5 kb, 6 kb, 7 kb, 8kb, 90 kb, 100 kb, or more consecutive nucleotides. In some embodiments,the average distance in a target nucleic acid between integratedtransposon sequences is at least about 100 kb, 200 kb, 300 kb, 400 kb,500 kb, 600 kb, 700 kb, 800 kb, 900 kb, 1000 kb, or more consecutivenucleotides. As will be understood, some conditions that may be selectedinclude contacting a target nucleic acid with a certain number oftransposon sequences.

Some embodiments of the methods described herein include selectingconditions sufficient to achieve at least a portion of transposonsequences integrated into a target nucleic acid that are different. Inpreferred embodiments of the methods and compositions described herein,each transposon sequence integrated into a target nucleic acid isdifferent. Some conditions that may be selected to achieve a certainportion of transposon sequences integrated into target sequences thatare different include selecting the degree of diversity of thepopulation of transposon sequences. As will be understood, the diversityof transposon sequences arises in part due to the diversity of thebarcodes of such transposon sequences. Accordingly, some embodimentsinclude providing a population of transposon sequences in which at leasta portion of the barcodes are different. In some embodiments, at leastabout 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 98%, 99%, or100% of barcodes in a population of transposon sequences are different.In some embodiments, at least a portion of the transposon sequencesintegrated into a target nucleic acid are the same.

Some embodiments of preparing a template nucleic acid can includecopying the sequences comprising the target nucleic acid. For example,some embodiments include hybridizing a primer to a primer site of atransposon sequence integrated into the target nucleic acid. In somesuch embodiments, the primer can be hybridized to the primer site andextended. The copied sequences can include at least one barcode sequenceand at least a portion of the target nucleic acid. In some embodiments,the copied sequences can include a first barcode sequence, a secondbarcode sequence, and at least a portion of a target nucleic aciddisposed therebetween. In some embodiments, at least one copied nucleicacid can include at least a first barcode sequence of a first copiednucleic acid that can be identified or designated to be paired with asecond barcode sequence of a second copied nucleic acid. In someembodiments, the primer can include a sequencing primer. In someembodiments sequencing data is obtained using the sequencing primer. Inmore embodiments, adaptors comprising primer sites can be ligated toeach end of a nucleic acid, and the nucleic amplified from such primersites.

Some embodiments of preparing a template nucleic acid can includeamplifying sequences comprising at least a portion of one or moretransposon sequences and at least a portion of a target nucleic acid. Insome embodiments, at least a portion of a target nucleic acid can beamplified using primers that hybridize to primer sites of integratedtransposon sequences integrated into a target nucleic acid. In some suchembodiments, an amplified nucleic acid can include a first barcodesequence, and second barcode sequence having at least a portion of thetarget nucleic acid disposed therebetween. In some embodiments, at leastone amplified nucleic acid can include at least a first barcode sequenceof a first amplified nucleic acid that can be identified to be pairedwith a second barcode sequence of a second amplified sequence.

Some methods of preparing template nucleic acids include insertingtransposon sequences comprising single-stranded linkers. FIG. 3illustrates an example in which transposon sequences(ME-P1-linker-P2-ME; mosaic end-primer site 1-linker-primer site2-mosaic end) are inserted into a target nucleic acid. The targetnucleic acid having the inserted transposon/linker sequences can beextended and amplified.

In one embodiment of the compositions and methods described herein,transposomes are used that have symmetrical transposable end sequencesto produce an end-tagged target nucleic acid fragment (tagmentedfragment or tagment). Each tagmented fragment therefore containsidentical ends, lacking directionality. A single primer PCR, using thetransposon end sequences, can then be employed to amplify the templatecopy number from 2n to 2n*2^(x) where x corresponds to the number of PCRcycles. In a subsequent step, PCR with primers can add additionalsequences, such as sequencing adapter sequences.

In some embodiments, it can be advantageous for each template nucleicacid to incorporate at least one universal primer site. For example, atemplate nucleic, acid can include first end sequences that comprise afirst universal primer site, and second end sequences that comprise asecond universal primer site. Universal primer sites can have variousapplications, such as use in amplifying, sequencing, and/or identifyingone or more template nucleic acids. The first and second universalprimer sites can be the same, substantially similar, similar, ordifferent. Universal primer sites can be introduced into nucleic acidsby various methods well known in the art, for example, ligation ofprimer sites to nucleic acids, amplification of nucleic acids usingtailed primers, and insertion of a transposon sequence comprising auniversal primer site.

Targeted Insertion

In some embodiments of the methods and compositions provided herein,transposon sequences may be inserted at particular targeted sequences ofa target nucleic acid. Transposition into dsDNA can be more efficientthan into ssDNA targets. In some embodiments, dsDNA is denatured intossDNA and annealed with oligonucleotide probes (20-200 bases). Theseprobes create sites of dsDNA that can be efficiently used as integrationsites with transposomes provided herein. In some embodiments, dsDNA canbe targeted using D-loop formation with recA-coated oligo probes, andsubsequent triplex formation. In some such embodiments, the replicationfork structure is the preferred substrate for transposomes comprisingTn4430 transposase. In more embodiments, regions of interest in dsDNAcan be targeted using sequence-specific DNA binding proteins such aszinc-finger complexes, and other affinity ligands to specific DNAregions.

In some embodiments, transposomes comprising a transposase having apreferred substrate of mismatched positions in a target nucleic acid maybe used to target insertion into the target nucleic acid. For example,some MuA transposases, such as HYPERMU (Epicenter), have a preferencefor mismatched targets. In some such embodiments, oligonucleotide probescomprising a mismatch are annealed to a single-stranded target nucleicacid. Transposomes comprising MuA transposases, such as HYPERMU, can beused to target the mismatched sequences of the target nucleic acid.

Fragmenting Template Nucleic Acids

Some embodiments of preparing a template nucleic acid can includefragmenting a target nucleic acid. In some embodiments, insertion oftransposomes comprising non-contiguous transposon sequences can resultin fragmentation of a target nucleic acid. In some embodimentscomprising looped transposomes a target nucleic acid comprisingtransposon sequences can be fragmented at the fragmentation sites of thetransposon sequences. Further examples of method useful to fragmenttarget nucleic acids useful with the embodiments provided herein can befound in for example, U.S. Patent Application Pub. No. 2012/0208705,U.S. Patent Application Pub. No. 2012/0208724 and Int. PatentApplication Pub. No. WO 2012/061832, each of which is incorporated byreference in its entirety.

Tagging Single Molecules

The present invention provides methods for tagging molecules so thatindividual molecules can be tracked and identified. The bulk data canthen be deconvoluted and converted back to the individual molecule. Theability to distinguish individual molecules and relate the informationback to the molecule of origin is especially important when processesfrom original molecule to final product change the (stoichiometric)representation of the original population. For example, amplificationleads to duplication (e.g., PCR duplicates or biased amplification) thatcan skew the original representation. This can alter the methylationstate call, copy number, allelic ratio due to non-uniform amplificationand/or amplification bias. By identifying individual molecules,code-tagging distinguishes between identical molecules after processing.As such, duplications, and amplification bias can be filtered out,allowing accurate determination of the original representation of amolecule or population of molecules.

An advantage of uniquely tagging single molecules is that identicalmolecules in the original pool become uniquely identified by virtue oftheir tagging. In further downstream analyses, these uniquely taggedmolecules can now be distinguished. This technique can be exploited inassay schemes in which amplification is employed. For example,amplification is known to distort the original representation of a mixedpopulation of molecules. If unique tagging were not employed, theoriginal representation (such as copy number or allelic ratio) wouldneed to account for the biases (known or unknown) for each molecule inthe representation. With unique tagging, the representation canaccurately be determined by removing duplicates and counting theoriginal representation of molecules, each having a unique tag. Thus,cDNAs can be amplified and sequenced, without fear of bias because thedata can be filtered so that only authentic sequences or sequences ofinterest are selected for further analysis. Accurate reads can beconstructed by taking the consensus across many reads with the samebarcode.

In some embodiments of the compositions and methods described herein, itis preferred to tag the original population in the early stages of theassay, although tagging can occur at later stages if the earlier stepsdo not introduce bias or are not important. In any of theseapplications, the complexity of the barcode sequences should be largerthan the number of individual molecules to be tagged. This ensures thatdifferent target molecules receive different and unique tags. As such, apool of random oligonucleotides of a certain length (e.g., 5, 10, 20,30, 40, 50, 100 or 200 nucleotides in length) is desirable. A randompool of tags represents a large complexity of tags with code space 4^(n)where n is the number of nucleotides. Additional codes (whether designedor random) can be incorporated at different stages to serve as a furthercheck, such as a parity check for error correction.

In one embodiment of the compositions and methods described herein,individual molecules (such as target DNA) are attached to unique labels,such as unique oligo sequences and/or barcodes. Attachment of the labelscan occur through ligation, coupling chemistry, adsorption, insertion oftransposon sequences, etc. Other means include amplification (such as byPCR, RCA or LCR), copying (such as addition by a polymerase), andnon-covalent interactions.

Specific methods comprise including barcodes (e.g., designed or randomsequences) to PCR primers so that each template will receive anindividual code within the code space, thereby yielding unique ampliconsthat can be discriminated from other amplicons. This concept can beapplied to any method that uses polymerase amplification, such asGoldenGate assays as disclosed in U.S. Pat. Nos. 7,582,420, 7,955,794,and 8,003,354, each of which is incorporated by reference in itsentirety. Code-tagged target sequences can be circularized and amplifiedby methods such as rolling-circle amplification to yield code-taggedamplicons. Similarly, the code can also be added to RNA

Methods of Analyzing Template Nucleic Acids

Some embodiments of the technology described herein include methods ofanalyzing template nucleic acids. In such embodiments, sequencinginformation can be obtained from template nucleic acids and thisinformation can be used to generate a sequence representation of one ormore target nucleic acids.

In some embodiments of the sequencing methods described herein, a linkedread strategy may be used. A linked read strategy can includeidentifying sequencing data that links at least two sequencing reads.For example, a first sequencing read may contain a first marker, and asecond sequencing read may contain a second marker. The first and secondmarkers can identify the sequencing data from each sequencing read to beadjacent in a sequence representation of the target nucleic acid. Insome embodiments of the compositions and methods described herein,markers can comprise a first barcode sequence and a second barcodesequence in which the first barcode sequence can be paired with thesecond barcode sequence. In other embodiments, markers can comprise afirst host tag and a second host tag. In more embodiments, markers cancomprise a first barcode sequence with a first host tag, and a secondbarcode sequence with a second host tag

An exemplary embodiment of a method for sequencing a template nucleicacid can comprise the following steps: (a) sequence the first barcodesequence using a sequencing primer hybridizing to the first primer site;and (b) sequence the second barcode sequence using a sequencing primerhybridizing to the second primer. The result is two sequence reads thathelp link the template nucleic acid to its genomic neighbors. Given longenough reads, and short enough library fragments, these two reads can bemerged informatically to make one long read that covers the entirefragment. Using the barcode sequence reads and the 9 nucleotideduplicated sequence present from the insertion, reads can now be linkedto their genomic neighbors to form much longer “linked reads” in silico.

As will be understood, a library comprising template nucleic acids caninclude duplicate nucleic acid fragments. Sequencing duplicate nucleicacid fragments is advantageous in methods that include creating aconsensus sequence for duplicate fragments. Such methods can increasethe accuracy for providing a consensus sequence for a template nucleicacid and/or library of template nucleic acids.

In some embodiments of the sequencing technology described herein,sequence analysis is performed in real time. For example, real timesequencing can be performed by simultaneously acquiring and analyzingsequencing data. In some embodiments, a sequencing process to obtainsequencing data can be terminated at various points, including after atleast a portion of a target nucleic acid sequence data is obtained orbefore the entire nucleic acid read is sequenced. Exemplary methods,systems, and further embodiments are provided in International PatentPublication No. WO 2010/062913, the disclosure of which is incorporatedherein by reference in its entirety.

In an exemplary embodiment of a method for assembling short sequencingreads using a linked read strategy, transposon sequences comprisingbarcodes are inserted into genomic DNA, a library is prepared andsequencing data is obtained for the library of template nucleic acids.Blocks of templates can be assembled by identifying paired barcodes andthen larger contigs are assembled. In one embodiment, the assembledreads can be further assembled into larger contigs through code pairingusing overlapping reads.

Some embodiments of the sequencing technology described herein includeerror detection and correction features. Examples of errors can includeerrors in base calls during a sequencing process, and errors inassembling fragments into larger contigs. As would be understood, errordetection can include detecting the presence or likelihood of errors ina data set, and as such, detecting the location of an error or number oferrors may not be required. For error correction, information regardingthe location of an error and/or the number of errors in a data set isuseful. Methods for error correction are well known in the art. Examplesinclude the use of hamming distances, and the use of a checksumalgorithm (See, e.g., U.S. Patent Application Publication No.2010/0323348; U.S. Pat. Nos. 7,574,305; and 6,654,696, the disclosuresof which are incorporated herein by reference in their entireties).

Nested Libraries

An alternative method involves the junction tagging methods above andpreparation of nested sequencing libraries. The nested sub-libraries arecreated from code-tagged DNA fragments. This can allow less frequenttransposon tagging across the genome. It can also create a largerdiversity of (nested) sequencing reads. These factors can lead toimproved coverage and accuracy.

Sub-sampling and whole genome amplification can create many copies of acertain population of starting molecules. DNA fragments are thengenerated by transposon-specific fragmentation, where each fragmentreceives a code that allows one to link the fragment back to theoriginal neighbor having a matching code (whether identical,complementary or otherwise informatically linked). The tagged fragmentsare fragmented at least a second time by random methods orsequence-specific methods, such as enzymatic digestion, random shearing,transposon-based shearing or other methods, thereby creatingsub-libraries of the code-tagged DNA fragments. In a useful variation ofthe previously-described method, code-tagged fragments can bepreferentially isolated by using transposons that contain a biotin orother affinity functionality for downstream enrichment purposes.Subsequent library preparation converts the nested DNA fragments intosequencing templates. Paired-end sequencing results in determination ofthe sequence of the code-tag of the DNA fragments and of the target DNA.Since nested libraries for the same code-tag are created, long DNAfragments can be sequenced with short reads.

Sequencing Methods

The methods and composition described herein can be used in conjunctionwith a variety of sequencing techniques. In some embodiments, theprocess to determine the nucleotide sequence of a target nucleic acidcan be an automated process.

Some embodiments of the sequencing methods described herein includesequencing by synthesis (SBS) technologies, for example, pyrosequencingtechniques. Pyrosequencing detects the release of inorganicpyrophosphate (PP_(i)) as particular nucleotides are incorporated intothe nascent strand (Ronaghi et al., Analytical Biochemistry 242(1): 84-9(1996); Ronaghi, M. Genome Res. 11(1):3-11 (2001); Ronaghi et al.,Science 281(5375):363 (1998); U.S. Pat. Nos. 6,210,891; 6,258,568 and6,274,320, each of which is incorporated by reference in its entirety).

In another example type of SBS, cycle sequencing is accomplished bystepwise addition of reversible terminator nucleotides containing, forexample, a cleavable or photobleachable dye label as described, forexample, in U.S. Pat. Nos. 7,427,67, 7,414,1163 and 7,057,026, each ofwhich is incorporated by reference in its entirety. This approach, whichis being commercialized by Illumina Inc., is also described inInternational Patent Application Publication Nos. WO 91/06678 and WO07/123744, each of which is incorporated by reference in its entirety.The availability of fluorescently-labeled terminators, in which both thetermination can be reversed and the fluorescent label cleaved,facilitates efficient cyclic reversible termination (CRT) sequencing.Polymerases can also be co-engineered to efficiently incorporate andextend from these modified nucleotides.

Additional exemplary SBS systems and methods which can be utilized withthe methods and compositions described herein are described in U.S.Patent Application Publication No. 2007/0166705, U.S. Patent ApplicationPublication No. 2006/0188901, U.S. Pat. No. 7,057,026, U.S. PatentApplication Publication No. 2006/0240439, U.S. Patent ApplicationPublication No. 2006/0281109, PCT Publication No. WO 05/065814, U.S.Patent Application Publication No. 2005/0100900, PCT Publication No. WO06/064199 and PCT Publication No. WO 07/010251, each of which isincorporated by reference in its entirety.

Some embodiments of the sequencing technology described herein canutilize sequencing by ligation techniques. Such techniques utilize DNAligase to incorporate nucleotides and identify the incorporation of suchnucleotides. Exemplary SBS systems and methods which can be utilizedwith the compositions and methods described herein are described in U.S.Pat. Nos. 6,969,488, 6,172,218, and 6,306,597, each of which isincorporated by reference in its entirety.

Some embodiments of the sequencing technology described herein caninclude techniques such as next-next technologies. One example caninclude nanopore sequencing techniques (Deamer, D. W. & Akeson, M.“Nanopores and nucleic acids: prospects for ultrarapid sequencing.”Trends Biotechnol. 18, 147-151 (2000); Deamer, D. and D. Branton,“Characterization of nucleic acids by nanopore analysis”. Acc. Chem.Res. 35:817-825 (2002); Li et al., “DNA molecules and configurations ina solid-state nanopore microscope” Nat. Mater. 2:611-615 (2003), each ofwhich is incorporated by reference in its entirety). In suchembodiments, the target nucleic acid passes through a nanopore. Thenanopore can be a synthetic pore or biological membrane protein, such asα-hemolysin. As the target nucleic acid passes through the nanopore,each base-pair can be identified by measuring fluctuations in theelectrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni &Meller, “A. Progress toward ultrafast DNA sequencing using solid-statenanopores,” Clin. Chem. 53, 1996-2001 (2007); Healy, K. “Nanopore-basedsingle-molecule DNA analysis,” Nanomed. 2:459-481 (2007); Cockroft etal., “A single-molecule nanopore device detects DNA polymerase activitywith single-nucleotide resolution.” J. Am. Chem. Soc. 130:818-820(2008), each of which is incorporated by reference in its entirety). Insome such embodiments, nanopore sequencing techniques can be useful toconfirm sequence information generated by the methods described herein.

Some embodiments of the sequencing technology described herein canutilize methods involving the real-time monitoring of DNA polymeraseactivity. Nucleotide incorporations can be detected through fluorescenceresonance energy transfer (FRET) interactions between afluorophore-hearing polymerase and γ-phosphate-labeled nucleotides asdescribed, for example, in U.S. Pat. Nos. 7,329,492 and 7,211,414 ornucleotide incorporations can be detected with zero-mode waveguides asdescribed, for example, in U.S. Pat. No. 7,315,019 and using fluorescentnucleotide analogs and engineered polymerases as described, for example,in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No.2008/0108082, each of which is incorporated by reference in itsentirety. The illumination can be restricted to a zeptoliter-scalevolume around a surface-tethered polymerase such that incorporation offluorescently labeled nucleotides can be observed with low background(Levene, M. J. et al. “Zero-mode waveguides for single-molecule analysisat high concentrations.” Science 299, 682-686 (2003); Lundquist, P. M.et al. “Parallel confocal detection of single molecules in real time.”Opt. Lett. 33, 1026-1028 (2008); Korlach, J, et al. “Selective aluminumpassivation for targeted immobilization of single DNA polymerasemolecules in zero-mode waveguide nanostructures,” Proc. Natl. Acad. Sci.USA 105, 1176-1181 (2008), each of which is incorporated by reference inits entirety). In one example, single molecule, real-time (SMRT) DNAsequencing technology provided by Pacific Biosciences Inc. can beutilized with the methods described herein. In some embodiments, a SMRTchip or the like may be utilized (e.g., U.S. Pat. Nos. 7,181,122,7,302,146 and 7,313,308, each of which is incorporated by reference inits entirety). A SMRT chip comprises a plurality of zero-mode waveguides(ZMW). Each ZMW comprises a cylindrical hole tens of nanometers indiameter perforating a thin metal film supported by a transparentsubstrate. When the ZMW is illuminated through the transparentsubstrate, attenuated light may penetrate the lower 20-30 nm of each ZMWcreating a detection volume of about 1×10⁻²¹ L. Smaller detectionvolumes increase the sensitivity of detecting fluorescent signals byreducing the amount of background that can be observed.

SMRT chips and similar technology can be used in association withnucleotide monomers fluorescently labeled on the terminal phosphate ofthe nucleotide (Korlach J, et al., “Long, processive enzymatic DNAsynthesis using 100% dye-labeled terminal phosphate-linked nucleotides.”Nucleosides, Nucleotides and Nucleic Acids, 27:1072-1083, 2008, which isincorporated by reference in its entirety). The label is cleaved fromthe nucleotide monomer on incorporation of the nucleotide into thepolynucleotide. Accordingly, the label is not incorporated into thepolynucleotide, increasing the signal: background ratio. Moreover, theneed for conditions to cleave a label from labeled nucleotide monomersis reduced.

An additional example of a sequencing platform that may be used inassociation with some of the embodiments described herein is provided byHelicos Biosciences Corp. In some embodiments, TRUE SINGLE MOLECULESEQUENCING can be utilized (Harris T. D. et al., “Single Molecule DNASequencing of a viral Genome” Science 320:106-109 (2008), which isincorporated by reference in its entirety). In one embodiment, a libraryof target nucleic acids can be prepared by the addition of a 3′ poly(A)tail to each target nucleic acid. The poly(A) tail hybridizes to poly(T)oligonucleotides anchored on a glass cover slip. The poly(T)oligonucleotide can be used as a primer for the extension of apolynucleotide complementary to the target nucleic acid. In oneembodiment, fluorescently-labeled nucleotide monomers, namely, A, C, G,or T, are delivered one at a time to the target nucleic acid in thepresence DNA polymerase. Incorporation of a labeled nucleotide into thepolynucleotide complementary to the target nucleic acid is detected, andthe position of the fluorescent signal on the glass cover slip indicatesthe molecule that has been extended. The fluorescent label is removedbefore the next nucleotide is added to continue the sequencing cycle.Tracking nucleotide incorporation in each polynucleotide strand canprovide sequence information for each individual target nucleic acid.

An additional example of a sequencing platform that can be used inassociation with the methods described herein is provided by CompleteGenomics Inc. Libraries of target nucleic acids can be prepared wheretarget nucleic acid sequences are interspersed approximately every 20 bpwith adaptor sequences. The target nucleic acids can be amplified usingrolling circle replication, and the amplified target nucleic acids canbe used to prepare an array of target nucleic acids. Methods ofsequencing such arrays include sequencing by ligation, in particular,sequencing by combinatorial probe-anchor ligation (cPAL).

In some embodiments using cPAL, about 10 contiguous bases adjacent to anadaptor may be determined. A pool of probes that includes four distinctlabels for each base (A, C, T, G) is used to read the positions adjacentto each adaptor. A separate pool is used to read each position. A poolof probes and an anchor specific to a particular adaptor is delivered tothe target nucleic acid in the presence of ligase. The anchor hybridizesto the adaptor, and a probe hybridizes to the target nucleic acidadjacent to the adaptor. The anchor and probe are ligated to oneanother. The hybridization is detected and the anchor-probe complex isremoved. A different anchor and pool of probes is then delivered to thetarget nucleic acid in the presence of ligase.

The sequencing methods described herein can be advantageously carriedout in multiplex formats such that multiple different target nucleicacids are manipulated simultaneously. In particular embodiments,different target nucleic acids can be treated in a common reactionvessel or on a surface of a particular substrate. This allows convenientdelivery of sequencing reagents, removal of unreacted reagents anddetection of incorporation events in a multiplex manner. In embodimentsusing surface-bound target nucleic acids, the target nucleic acids canbe in an array format. In an array format, the target nucleic acids canbe typically coupled to a surface in a spatially distinguishable manner.For example, the target nucleic acids can be bound by direct covalentattachment, attachment to a bead or other particle or associated with apolymerase or other molecule that is attached to the surface. The arraycan include a single copy of a target nucleic acid at each site (alsoreferred to as a feature) or multiple copies having the same sequencecan be present at each site or feature. Multiple copies can be producedby amplification methods such as, bridge amplification or emulsion PCRas described in further detail herein.

The methods set forth herein can use arrays having features at any of avariety of densities including, for example, at least about 10features/cm², 100 features/cm², 500 features/cm², 1,000 features/cm²,5,000 features/cm², 10,000 features/cm², 50,000 features/cm², 100,000features/cm², 1,000,000 features/cm², 5,000,000 features/cm², 10⁷features/cm², 5×10⁷ features/cm², 10⁸ features/cm², 5×10⁸ features/cm²,10⁹ features/cm², 5×10⁹ features/cm², or higher.

Surfaces

In some embodiments, the nucleic acid template provided herein can heattached to a solid support (“substrate”). Substrates can be two-orthree-dimensional and can comprise a planar surface (e.g., a glassslide) or can be shaped. A substrate can include glass (e.g., controlledpore glass (CPG)), quartz, plastic (such as polystyrene (lowcross-linked and high cross-linked polystyrene), polycarbonate,polypropylene and poly(rnethylmethacrylate)), acrylic copolymer,polyamide, silicon, metal (e.g., alkanethiolate-derivatized gold),cellulose, nylon, latex, dextran, gel matrix (e.g., silica gel),polyacrolein, or composites.

Suitable three-dimensional substrates include, for example, spheres,microparticles, beads, membranes, slides, plates, micromachined chips,tubes (e.g., capillary tubes), microwells, microfluiclic devices,channels, filters, or any other structure suitable for anchoring anucleic acid. Substrates can include planar arrays or matrices capableof having regions that include populations of template nucleic acids orprimers. Examples include nucleoside-derivatized CPG and polystyreneslides; derivatized magnetic slides; polystyrene grafted withpolyethylene glycol, and the like. Various methods well known in the artcan be used to attach, anchor or immobilize nucleic acids to the surfaceof the substrate.

Methods for Reducing Error Rates in Sequencing Data

Some embodiments of the methods and compositions provided herein includereducing the error rates in sequencing data. In some such embodiments,the sense and antisense strands of a double-stranded target nucleic acidare each associated with a different barcode. Each strand is amplified,sequence information is obtained from multiple copies of the amplifiedstrands, and a consensus sequence representation of the target nucleicacid is generated from the redundant sequence information. Thus,sequence information can originate and be identified from each strand.Accordingly, sequence errors can be identified and reduced wheresequence information originating from one strand is inconsistent withsequence information from the other strand.

In some embodiments, the sense and antisense strands of a target nucleicacid are associated with a different barcode. The barcodes may beassociated with the target nucleic acid by a variety of methodsincluding ligation of adaptors and insertion of transposon sequences. Insome such embodiments, a Y-adaptor may be ligated to at least one end ofa target nucleic acid. The Y-adaptor can include a double-strandedsequence, and non-complementary strands, each strand comprising adifferent barcode. The target nucleic acid with ligated Y-adaptor can beamplified and sequenced such that each barcode can be used to identifythe original sense or antisense strands. A similar method is describedin Kinde I. et al., (2011) PNAS 108:9530-9535, the disclosure of whichis incorporated herein by reference in its entirety. In someembodiments, the sense and antisense strands of a target nucleic acidare associated with a different barcode by inserting transposonsequences provided herein. In some such embodiments, the transposonsequences can comprise non-complementary barcodes.

Some embodiments of such methods include obtaining sequence informationfrom a strand of a target double-stranded nucleic acid comprising (a)obtaining sequence data from a template nucleic acid comprising a firstsequencing adapter and a second sequencing adapter having at least aportion of the double-stranded target nucleic acid disposedtherebetween, wherein: (i) the first sequencing adapter comprises adouble-stranded first barcode, a single-stranded first primer site and asingle-stranded second primer site, wherein the first and second primersites are non-complementary, and (ii) the second sequencing adaptercomprising a double-stranded second barcode, a single-stranded thirdprimer site and a single-stranded fourth primer site, wherein the thirdand fourth primer sites are non-complementary. In some embodiments, thefirst primer site of the sense strand of the template nucleic acid andthe third primer site of the antisense sense strand of the templatenucleic acid comprise the same sequence. In some embodiments, eachbarcode is different. In some embodiments, the first sequencing adaptercomprises a single-stranded hairpin coupling the first primer site andsecond primer site.

In another embodiment, each end of a target nucleic acid is associatedwith an adaptor comprising a different barcode such that extensionproducts from the sense and antisense strand of a nucleic acid can bedistinguished from each other. In some embodiments, primer sitesequences and barcodes are selected such that extension from a primerannealed to the sense strand yields products that can be distinguishedfrom products of extension from a primer annealed to the antisensestrand. In an example, the 3′ sense primer site is the same as the 3′antisense primer site, but different from both the 5′ sense and 5′antisense primer sites. Extension of primers annealed to the 3′ senseprimer site and the 3′ antisense primer site would yield the followingproducts from each strand:

Sense strand: (5′) barcode 2−[target sequence]−barcode 1 (3′)

Antisense strand: (5′) barcode 1−[target sequence]−barcode 2 (3′)

Thus, extension products from the sense and antisense strand of anucleic acid can be distinguished from each other. An exemplary methodis illustrated in Schmitt M. W., et al., PNAS (2012) 109:14508-13, thedisclosure of which is incorporated herein by reference in its entirety.In some such methods, the barcodes and primers sites may be associatedwith the target nucleic acid by a variety of methods including ligationof adaptors and insertion of transposon sequences. In some embodiments,transposon sequences can be designed to provide adaptors with hairpins.Hairpins provide the ability to maintain the physical contiguity of thesense and antisense strands of a target nucleic acid. A template nucleicacid can be prepared comprising hairpins using transposon sequencescomprising linkers described herein. Examples of linkers includesingle-stranded nucleic acids.

Some embodiments of preparing a library of template nucleic acids forobtaining sequence information from each strand of a double-strandedtarget nucleic acid include (a) providing a population of transposomescomprising a transposase and a first transposon sequence comprising: (i)a first transposase recognition site, a first primer site, and a firstbarcode, and (ii) a second transposon sequence comprising a secondtransposase recognition site, a second primer site, and a secondbarcode, wherein the first transposon sequence is non-contiguous withthe second transposon sequence; and (b) contacting the transposomes witha double-stranded nucleic acid under conditions such that said first andsecond transposon sequences insert into the double-stranded targetnucleic acid, thereby preparing a library of template nucleic acids forobtaining sequence information from each strand of the double-strandedtarget nucleic acid. In some embodiments, the population of transposomesfurther comprises transposomes comprising a transposase and a transposonsequence comprising a third transposase recognition site and a fourthtransposase recognition site having a barcode sequence disposedtherebetween, said barcode sequence comprising a third barcode and afourth barcode having a sequencing adapter disposed therebetween, saidsequencing adapter comprising a third primer site and a fourth primersite having a linker disposed therebetween. In some embodiments, thefirst primer site of the sense strand of the template nucleic acid andthe third primer site of the antisense sense strand of the templatenucleic acid comprise the same sequence. Some embodiments also include astep (c) selecting for template nucleic acids comprising transposonsequences wherein the first transposon sequence is non-contiguous withthe second transposon sequence and transposon sequences comprising alinker. In some embodiments, the linker comprises an affinity tagadapted to bind with a capture probe. In some embodiments, the affinitytag is selected from the group consisting of His, biotin, andstreptavidin. In some embodiments, each barcode is different. In someembodiments, the linker comprises a single-stranded nucleic acid. Insome embodiments, the target nucleic acid comprises genomic DNA.

Methods for Obtaining Haplotype Information

Some embodiments of the methods and compositions provided herein includemethods of obtaining haplotype information from a target nucleic acid.Haplotype information can include determining the presence or absence ofdifferent sequences at specified loci in a target nucleic acid, such asa genome. For example, sequence information can be obtained for maternaland paternal copies of an allele. In a polyploidy organism, sequenceinformation can be obtained for at least one haplotype. Such methods arealso useful in reducing the error rate in obtaining sequence informationfrom target nucleic acid.

Generally, methods to obtain haplotype information include distributinga nucleic acid into one or more compartments such that each compartmentcomprises an amount of nucleic acid equivalent to about a haploidequivalent of the nucleic acid, or equivalent to less than about ahaploid equivalent of the nucleic acid. Sequence information can then beobtained from each compartment, thereby obtaining haplotype information.Distributing the template nucleic acid into a plurality of vesselsincreases the probability that a single vessel includes a single copy ofan allele or SNP, or that consensus sequence information obtained from asingle vessel reflects the sequence information of an allele or SNP. Aswill be understood, in some such embodiments, a template nucleic acidmay be diluted prior to compartmentalizing the template nucleic acidinto a plurality of vessels. For example, each vessel can contain anamount of target nucleic acids equal to about a haploid equivalent ofthe target nucleic acid. In some embodiments, a vessel can include lessthan about one haploid equivalent of a target nucleic acid.

Method of Haplotyping with Virtual Compartments

Some methods of obtaining haplotype information provided herein includethe use of virtual compartments. Advantageously, some such methodsenable compartments to include amounts of nucleic acids equivalent to atleast one or more haploid equivalents. In other words, such methodsenable the use of higher concentrations of nucleic acids in compartmentscompared to other methods of haplotyping, thereby increasing theefficiency and yields of various manipulations.

In some methods to obtain haplotype information with virtualcompartments, a nucleic acid is compartmentalized into a plurality offirst vessels, and the nucleic acids of each compartment are providedwith a first index; the first-indexed nucleic acids are combined, andthen compartmentalized into a plurality of second vessels, and thenucleic acids of each compartment are provided with a second index. Atemplate nucleic acid can be prepared by undergoing at least 2, 3, 4, 5,6, 7, 8, 9, 10 or more rounds of compartmentalizing, indexing, andpooling. In such a manner, a template nucleic acid is provided with aplurality of different indices in a stepwise method. Subsequent toindexing, the indexed template nucleic acids can be pooled anddistributed into a plurality of compartments such that each compartmentis likely to include an amount of a particular template nucleic acidhaving a particular combination of indexes that is equivalent to about ahaploid equivalent of the target nucleic acid, or equivalent to lessthan about a haploid equivalent of the target nucleic acid, orequivalent to more than about a haploid equivalent. In other words, eachvessel can receive an amount of template nucleic acid comprising morethan the equivalent of a haploid equivalent, however, each copy of anallele or SNP is likely to be associated with a different combination ofindexes. Accordingly, the number of vessels to compartmentalize atemplate nucleic acid such that each vessel includes about an amount oftemplate nucleic acid equivalent to a haploid or less of a targetnucleic acid can be reduced. In addition, the amount of nucleic acid ineach vessel can be greater than the amount of about a haploidequivalent, thereby increasing the efficiency and yields of variousmanipulations.

There are various methods to index nucleic acids. For example, in someembodiments, indexes may be inserted into nucleic acids usingtransposomes provide herein; indexes can be ligated to nucleic acids;and indexes can be added to nucleic, acids during copying, e.g.,amplification of a nucleic acid. In some embodiments, a template nucleicacid comprising an index can be prepared using transposomes comprising acontiguous transposon sequence. See e.g., transposome (50) in FIG. 1.Insertion of contiguous transposon sequences can result in thepreservation of positional information for a particular nucleic acidmolecule after distribution of template nucleic acids between severalcompartments. In some embodiments, a template nucleic acid comprising anindex can be prepared using transposomes comprising non-contiguoustransposon sequences. See e.g., transposome (10) in FIG. 1. Examples ofsuch transposon sequences are set forth in U.S. Patent ApplicationPublication No. 2010/0120098, which is incorporated herein by referencein its entirety. Insertion of non-contiguous transposon sequences canresult in the fragmentation of a particular nucleic acid molecule. Thus,in some embodiments, insertion of non-contiguous transposon sequencesinto a template nucleic acid can reduce positional information for aparticular nucleic acid molecule after distribution of the templatenucleic acids between several compartments. In other words, differentfragments of a particular nucleic acid molecule can be distributed intodifferent vessels.

In an example with a diploid genome, after pooling and dilution incompartments, a greater amount of nucleic acids can be added to eachcompartment since the chance of a copy from the father and copy from themother of the same region with the same indexes is lower. For example,one copy of the father and one copy of the mother for the same regioncan be present in the same compartment as long as each contains adifferent index, for example, one comes from a transposition reactionwith a first index (index-1) and the other comes from a transpositionreaction with a different first index (index-2). In other words, copiesof the same region/chromosome can be present in the same compartmentsince these can be distinguished by their unique index incorporated inthe first transposition reaction. This allows more DNA to be distributedinto each compartment compared to alternative dilution methods. The dualindexing scheme creates a total number of virtual compartments of numberof initial indexed transposition reactions multiplied by the number ofindexed PCR reactions.

FIG. 4 depicts an example embodiment of obtaining haplotype informationusing virtual compartments. A target nucleic acid comprising genomic DNAis distributed into a first set of 96 vessels and the nucleic acids ofeach vessel are provided with a different first index using aTn5-derived transposon. Thus a plurality of first-indexed templatenucleic acids is obtained (e.g., Tn5-1, Tn5-2 . . . , and Tn5-96). Theplurality of first-indexed template nucleic acids are combined and thenredistributed into a second set of 96 vessels and the nucleic acids ofeach vessel are provided with a different second index by amplificationof the nucleic acids using primers comprising the second indexes. Thus aplurality of second-indexed template nucleic acids is obtained (e.g.,PCR1, PCR2 . . . , and PCR96). The plurality of second-indexed templatenucleic acids can be combined and sequence information obtained. The useof 96×96 physical vessels is equivalent to 9216 virtual compartments.

Methods to Obtain Extended Haplotype Information

As described above, insertion of non-contiguous transposon sequencesinto the template nucleic acid can reduce positional information for aparticular nucleic acid molecule, for example, after distribution of thetemplate nucleic acids between several compartments. However, applicanthas discovered methods to preserve such positional information for aparticular nucleic acid molecule. Without being bound to any one theory,it has been observed that after transposition, the resulting twoadjacent fragments of a particular nucleic acid molecule will tend to bedistributed into the same vessel under conditions that maintain thetransposase at the site of insertion of a transposon sequence. In otherwords, the transposase may hold the two resulting two adjacent fragmentsof a particular nucleic acid molecule together.

In some embodiments, a transposase can be removed from a templatenucleic acid subsequent to distributing the template nucleic in severalvessels. A transposase can be removed from the site of an insertion byvarious methods well known in the art, including the addition of adetergent, such as SDS, changing temperature, Proteinase digestion,chaperone capture and changing pH. DNA polymerases, with or withoutstrand displacement properties including, but not limited to, phi29 DNApolymerase, Bst DNA polymerase, etc. can also be used to dislodge thetransposase from the DNA.

FIG. 5 depicts an example scheme in which a target nucleic acid isdistributed into a set of first vessels and indexed by insertion oftransposomes, such as transposomes comprising non-contiguous transposonsequences. The first indexed template nucleic acids are pooled anddistributed into a set of second vessels and indexed by PCRamplification. Sequence information can be obtained from the secondindexed template nucleic acids.

Some methods of obtaining extended haplotype information from a targetnucleic acid include (a) obtaining a template nucleic acid comprising aplurality of transposomes inserted into the target nucleic acid, whereinat least some of the inserted transposome each comprise a firsttransposon sequence, a second transposon sequence noncontiguous with thefirst transposon sequence, and a transposase associated with the firsttransposon sequence and the second transposon sequence; (b)compartmentalizing the template nucleic acid comprising the plurality ofinserted transposomes into each vessel of a plurality of vessels; (c)removing the transposase from the template nucleic acid; and (d)obtaining sequence information from the template nucleic acid of eachvessel, thereby obtaining haplotype information from the target nucleicacid. In some embodiments, compartmentalizing the template nucleic acidincludes providing each vessel with an amount of template nucleic acidequivalent to greater than about a haploid equivalent of the targetnucleic acid, an amount of template nucleic acid equivalent to about onehaploid equivalent of the target nucleic acid, or an amount of templatenucleic acid equivalent to less than about a haploid equivalent of thetarget nucleic acid.

An additional embodiment for maintaining contiguity of target nucleicacids for sequencing applications comprises utilizing one-sided (i.e.,one transposon end) transpositional events in lieu of two-sided (i.e.,two transposon ends) transpositional events as disclosed herein. Forexample, transposases including, but not limited to Mu, MuE392Q mutant,Tn5 have been shown to display one-sided transposition of a transposonsequence into a target nucleic acid (Haapa et al., 1999, Nucl. AcidsRes. 27(3): 2777-2784). The one-sided transpositional mechanism of thesetransposases can be utilized in methods described here to maintain thecontiguity of a sample for sequencing, for example to haplotype orassemble a target nucleic acid.

In one example of one-sided transposition into a target DNA thetransposome, a Tn5 dimer transposase is associated with only onetransposon sequence end. In preferred embodiments, the transposon endcould further comprise additional sequences such as index sequences,barcodes, and/or primer sequences and the like which could be used, forexample, to identify a sample, amplify or extend the target nucleic acidand align fragment sequences. The transposome complex associates withthe target nucleic acid, in that case dsDNA. At the site of transposomeassociation, the transposase cleaves that strand of the target DNA andinserts the transposon and any other additional sequences at the pointof cleavage. The transposase remains associated with the target DNAuntil it is removes, for example after partitioning of the sample asdescribed herein the transposase can be removed by degradation (e.g.,use of SDS or other methods as described here). The target nucleic acid,in this case dsDNA, does not fragment after removal of the transposase,as such the transposon and any additional sequences can be incorporatedinto the target DNA without fragmenting the DNA. Once the transposase isremoved, target amplification by any means known in the art, in thisexample single or multiple primer amplification (due to incorporation ofmultiple different primer sequencing included in one or moretransposons) either exponential or linear such as targeted PCR or wholegenome amplification (for example by multiple strand displacement), canbe performed to create libraries for sequencing. As described hereinwith respect to the two-sided transposon sequences, a variety ofdifferent combinations of index, barcode, restriction endonucleasesite(s), and/or primer sequence could be included as part of thetransposon sequence depending on the needs of the user. As such,one-sided transposome complexes could also be utilized to maintaincontiguity of a target nucleic acid for methods disclosed herein fordetermining the haplotype of a target nucleic acid.

One sided transposomes can also be created from the twotransposon/transposase complexes or the looped transposon/transposasecomplexes disclosed herein. For example, one of the transposon sequencesof a two transposon complex or one end of the lopped transposon couldbe, for example, chemically modified or blocked so that transpositionwould not occur, or would minimally occur, at that end. For example, adideoxynucleotide, a hapten such as a biotin could be incorporated atthe end of one of the transposon ends which would inhibit transpositionat that end, thereby allowing for only one transposon, or one end of alooped transposon, to be inserted into the target nucleic acid.

In one embodiment, a method of obtaining sequence information from atarget nucleic acid comprises obtaining a template nucleic acidcomprising a plurality of transposons inserted into said target nucleicacid such that the contiguity of the template is retained,compartmentalizing the nucleic acids comprising the plurality ofinserted transposons into a plurality of vessels, generatingcompartment-specific indexed libraries from the transposed nucleic acidtargets and obtaining sequence information from the template nucleicacids in each vessel of the plurality of vessels.

Certain Methods for Preparing Target Nucleic Acids for Haplotyping

Some embodiments of the methods and compositions provided herein includepreparing target nucleic acids for haplotyping using the methodsprovided herein. Using a pre-amplification method, the number of uniquereads is increased by generating multiple identical copies of the samenucleic acid fragment as a contiguous product. In some such embodiments,a library is amplified by methods such as rolling circle amplification(RCA). In some embodiments, circular libraries of a target nucleic acidare prepared and the library amplified by RCA. Such methods generateextended long nucleic acids.

An example scheme is shown in FIG. 6. FIG. 6 depicts a method includingpreparing target nucleic acids for haplotyping by generating a librarycomprising circular molecules by mate-pair and selection of specific orrange of sizes from 1-10 kb or 10-20 kb, or 20 kb-50 kb, 50-200 kbnucleic acids; amplifying the library by RCA to generate extended lonenucleic acids; inserting indexes into the amplified library withtransposons; compartmentalizing the inserted library; removingtransposase with SDS; further indexing the library; and obtainingsequence information from the library.

Another example scheme is shown in FIG. 7. FIG. 7 depicts a methodincluding preparing target nucleic acids for haplotyping by generating alibrary comprising circular molecules by hairpin transposition, gapfill, and selection of a specific size or range of sizes from 1-10 kb or10-20 kb, or 20 kb-50 kb, 50-200 kb 5 nucleic acids; amplifying thelibrary by RCA to generate extended lone nucleic acids; insertingindexes into the amplified library with transposons; compartmentalizingthe inserted library; removing transposase with SDS; further indexingthe library; and obtaining sequence information from the library.

Methods to Generate Mate-Paired Libraries

Methods for generating mate-pair libraries include; fragmenting genomicDNA into large fragments typically greater than (though not limited to)1000 bp; circularizing individual fragments by a method that tags theligated junction; fragmenting the DNA further; enriching the taggedjunction sequences and ligating adaptors to the enriched junctionsequences so that they may be sequenced yielding information about thepair of sequences at the ends of the original long fragment of DNA.These processes involve at least 2 steps where DNA is fragmented, eitherphysically or enzymatically. In at least one or more distinct steps,adaptors are ligated to the ends of fragments. Mate-pair preps typicallytake 2-3 days to perform and comprise multiple steps of DNAmanipulations. The diversity of the resulting library correlatesdirectly with the number of steps required to make the library.

The method provided herein simplifies the number of steps in the librarygeneration protocol by employing a transposase mediated reaction thatsimultaneously fragments and adds adaptor sequences to the ends of thefragments. At least one or both of the fragmentation steps (initialfragmentation of genomic DNA and fragmentation of circularizedfragments) may be performed with a transposome, thus replacing the needfor separate fragmentation and adaptor ligation step. Obviating thepolishing, preparation of, and ligation to fragment ends reduces thenumber of process steps and thus increases the yield of usable data inthe prep as well as making the procedure more robust. In one embodiment,the protocol can be performed without resorting to methods that purify aselection of sizes based on electrophoresis. This method produces abroader range of fragment sizes than can be achieved with gelelectrophoretic methods but nonetheless produces usable data. Theadvantage is that a labour intensive step is avoided.

FIG. 8 provides an example scheme where just the initial fragmentationis replaced with a transposome tagmentation step. The circularized DNAis fragmented by either physical methods or chemical/enzymatic methodsand the fragments turned into a library via application of standardsample prep protocols (e.g. TRUSEQ). FIG. 9 illustrates an examplescheme where both the initial fragmentation and the circularized DNAfragmentation are performed with a transposome tagmentation step. Theadaptor sequences for the transposome (including the ME sequences) mayor may not be different for the adaptor used for the initial tagmentation and the subsequent circle tagmentation.

Amplifying template nucleic acid by generating multiple copies of eachmolecule before transposition or introduction of molecular indexescreates redundancy which can be useful for getting higher SNP coveragein each haplotype block, and also for de novo genome assembly, similarto a shotgun approach. Template nucleic acid can be converted to adefined-size library by low-frequency transposition, physical shearing,or enzymatic digestion, and then amplified for a finite number of cyclesby either PCR or a whole genome amplification scheme (for example usingphi29). The amplified library which already contains the built-inredundancy can be used as the input material for the haplotypingworkflow. This way, every region of the genome is represented multipletimes by multiple copies generated upfront with each copy contributing apartial coverage of that region; however, the consensus coverage will becloser to complete.

EXAMPLES Example 1—Reducing Error Rates

A library of template nucleic acids was prepared with each fragmentcomprising a different barcode. Each fragment was amplified and sequenceinformation was obtained from at least one amplified product from eachfragment. A consensus sequence was determined from the sequenceinformation from the amplified products from each fragment. Inparticular, a NEXTERA-prepped sequencing library was sequenced for 500cycles on a MISEQ instrument. The library consisted of a distribution ofsizes, with maximum read lengths extending to ˜300 nt. The error ratesat cycle 250 were approximately 15%. If a template was represented justthree times, the error rate dropped to ˜1% at cycle 250. FIG. 10illustrates a model of error rates with number of amplified productssequenced (coverage of each barcode). Error rates decrease as the numberof amplified products sequenced from a fragment increase.

Example 2—Coupling Transposon Sequences

This example illustrates methods to couple two transposon sequencestogether in various orientations including a 5′-5′ orientation, and a3′-3′ orientation. In an exemplary method, aldehyde oxyamine is used toform linked oligos via oxime ether formation. An aldehyde modified oligo(either on the 5′ or the 3′ end) is combined with an oligo modified withan oxyamine on the 5′ end in reaction buffer and allowed to incubate for2 hours. Final product can be isolated via PAGE purification forexample.

In another exemplary method, bisoxyamine coupling was performed.Aldehyde modified oligos were dimerized with a bis-oxyamine (e.g.,dioxyamino butane) linker using locally high concentrations to forcebi-substitution. 100-mer oligos were synthesized with an aldehyde on the5′ end and purified. The bisoxyamine oligo was synthesized. A 1 mMsolution of the bisoxyamine oligo was made in a low pH reaction buffercontaining a catalyst (5 M urea, 100 mM aniline, 10 mM citrate, 150 mMNaCl, pH 5.6) and added to a 665 μM solution of aldehyde oligo in water.The entire volume of the solution was diluted 1:1 with reaction buffer,and allowed to incubate at room temperature for 2 hours. A titration ofvarious aldehyde:bisoxyamine ratios showed dimerization at highbisoxyamine ratios. The most successful conditions were replicated with3′ aldehyde oligos. FIG. 11 shows results of 5′-5′ bisoxyamine couplingreactions in which looped precursor transposons were observed in theindicated dimer band. Similar results were observed for 3′-3′bisoxyamine coupling reactions.

Example 3—Monitoring Transposome Stability

Transposomes were prepared using long and short transposon sequencesloaded onto transposase. The transposome products included: A (2 shortsequences); B (long and short sequences); and C (2 long sequences). Therelative amounts of each species of transposome were measured undervarious conditions, such as temperature, buffers, ratios of transposonsequences to transposase. Generally, high NaCl or KCl salt increasedexchange of transposon sequences between transposomes. Glutamate andacetate buffers eliminated or reduced exchange, with preferredconcentrations between 100-600 mM. Optimum storage conditions weredetermined.

Example 4—Maintaining Template Contiguity

This example illustrates a method for maintaining contiguity informationof a template nucleic acid prepared using transposomes comprisingnon-contiguous transposon sequences in which Tn5 transposase stays boundto the template DNA post-transposition. Target nucleic acid wascontacted with transposomes comprising Tn5 transposase, andnon-contiguous transposon sequences. FIG. 12 shows that samples furthertreated with SDS appeared as a smear of various fragments of templatenucleic acid; samples not treated with SDS showed retention of putativehigh molecular weight template nucleic acid. Thus, even though a nucleicacid may be fragmented, adjacent sequences may still be associated withone another by the transposase (as demonstrated by the Tn5 bound DNAleft in the wells).

In still another exemplary method, a library of template nucleic acidswas prepared using transposomes comprising non-contiguous transposonsequences with target nucleic acid comprising human Chromosome 22. FIG.13 summarizes that haplotype blocks up to 100 kb were observed forsamples in which transposase was removed by SDS post-dilution. Thus, bypracticing methods as described herein target nucleic acids can maintaintarget integrity when transposed, be diluted, and be transformed intosequencing libraries.

Example 5—Maintaining Template Contiguity

Target nucleic acids were tagmented with transposomes comprisingnon-contiguous transposon sequences (NEXTERA), diluted to the desiredconcentration, and then treated with SDS to remove the transposaseenzyme before PCR. As a control, the same amount of input DNA wastagmented, treated with SDS first and then diluted to the desiredconcentration. SDS treatment before dilution removes proximityinformation since the transposase enzyme dissociates from the target DNAwith SDS, thereby fragmenting the target DNA. Two tagmentation reactionswere set-up on 50 ng of a Coriell gDNA, 1 reaction was stopped with 0.1%SDS and diluted to 6 pg. Next, the other reaction was first diluted to 6pg and then stopped with 0.1% SDS. The entire reactions were used to setup a 30-cycle PCR reaction and sequenced on a Gene Analyzer platform(Illumina), according to the manufacturer's instructions. The reads weremapped to a human reference genome and the distance distribution wascalculated and plotted.

As shown in FIG. 14, in the SDS-PostDilution sample, the median distanceshifted to smaller sizes and a large sub-population of proximallylocated reads becomes apparent. If there was any haplotyping, pile up ofproximal reads was expected. The bi-modal distribution of post-dilutionsample demonstrated that there is an enrichment of proximal reads.

The distance distribution was a measure of sample size (i.e. the moreunique reads, the shorter the distance). The distance histogram forSDS-PreDilution and SDS-PostDilution samples are shown in FIG. 14. Tocorrect for the difference in the number of unique reads, thePre-Dilution was down-sampled to give the same number of unique mappedreads (664,741). A significant enrichment was observed for reads thatare immediate neighbors (i.e. junctions). This was measured by lookingat the “distance to next alignment’ distribution and finding the readswhich their distance to their next alignment is the sequence read lengthminus 9 (which correspond to 9 bp duplication caused by Tn5 at theinsertion site). Such reads make up 10% of the data (FIG. 15) and withemployment of single primer system for amplification of NEXTERAlibraries, can be doubled. Also implementation of a more conservativesample prep which allows less sample loss allows recovery of morejunction data. The haplotype resolving power diminished when input DNAwas increased. On the other hand, reducing the DNA input required moreamplification, and therefore more PCR cycles, generating many PCRduplicates. Using individually barcoded Tn5 complexes allowed thetagmentation and subsequent dilutions to be carried out in separatecompartments. Low levels of input from individually barcoded andtagmented material were combined to elevate the PCR input DNA amounts tothe level that allowed more specific amplification with less waste ofsequencing capacity on redundant reads. Accordingly, using sufficientbarcoded complexes allowed phasing of the majority of the human genome.In order to increase the haplotype resolving power, barcoding wasimplemented at both the complex level and PCR primer level. Suchcombinatorial indexing scheme allows the use of very low input DNA fromeach individually barcoded complex into PCR reaction, which would allowpowerful haplotype resolution. Using only 40 indexing oligos (8+12=20for NEXTERA complexes which generates 8*12=96 individual complexes and8+12=20 for PCR primers which would allow 8*12=96 additional indexes),96*96=9216 virtual compartments were generated for the abovementionedhaplotyping workflow. Using a modified sequencing recipe, all the datawas sequenced on a HiSeq-2000. All 9216 possible barcode combinationswere observed in the sequencing results.

Example 6—Obtaining Haplotype Information with Mu

Transposomes comprising Mu were used to obtain haplotype information. 1ng of genomic DNA was targeted with Mu-TSM in a 50 μl reaction volumewith 1× TA buffer and 1, 2, 4, or 8 μl of 25 μM Mu-TSM complexes.Reactions were incubated at 37° C. for 2 hours. Samples were diluted to1 pg/μl. For Mu inactivation, 10 μl of each sample containing either 1pg or 5 pg total genomic DNA were prepared. SDS was added to finalconcentration of 0.05%. Samples were incubated at 55° C. for 20 minutes.The whole sample was used to set-up a 50 μl PCR reaction using NPM. PCRwas clone for 30 cycles. PCR samples were cleaned up with 0.6× SPRI andresuspended in 20 μl of re-suspension buffer. Sequencing information wasobtained. FIG. 16 shows the observed Pile-up proximal reads observed atsub-haploid content using transposomes comprising Mu.

The term “comprising” as used herein is synonymous with “including,”“containing,” or “characterized by,” and is inclusive or open-ended anddoes not exclude additional, unrecited elements or method steps.

All numbers expressing quantities of ingredients, reaction conditions,and so forth used in the specification are to be understood as beingmodified in all instances by the term “about.” Accordingly, unlessindicated to the contrary, the numerical parameters set forth herein areapproximations that may vary depending upon the desired propertiessought to be obtained. At the very least, and not as an attempt to limitthe application of the doctrine of equivalents to the scope of anyclaims in any application claiming priority to the present application,each numerical parameter should be construed in light of the number ofsignificant digits and ordinary rounding approaches.

The above description discloses several methods and materials of thepresent invention. This invention is susceptible to modifications in themethods and materials, as well as alterations in the fabrication methodsand equipment. Such modifications will become apparent to those skilledin the art from a consideration of this disclosure or practice of theinvention disclosed herein. Consequently, it is not intended that thisinvention be limited to the specific embodiments disclosed herein, butthat it cover all modifications and alternatives coming within the truescope and spirit of the invention.

All references cited herein, including hut not limited to published andunpublished applications, patents, and literature references, areincorporated herein by reference in their entirety and are hereby made apart of this specification. To the extent publications and patents orpatent applications incorporated by reference contradict the disclosurecontained in the specification, the specification is intended tosupersede and/or take precedence over any such contradictory material.

What is claimed is:
 1. A method for obtaining sequence information for asingle cell, comprising: generating indexed template nucleic acidfragments derived from template nucleic acid of single cells, whereinthe indexed template nucleic acid fragments comprise a first index andwherein the first index is associated with the template nucleic acidderived from the single cell, combining the cells and distributing thecells containing the indexed template nucleic acid fragments into aplurality of vessels; providing a second index to at least a portion ofthe indexed template nucleic acid fragments in a vessel of the pluralityof vessels to generate second indexed template nucleic acid fragmentscomprising the first index and the second index in the vessel; obtainingsequence data from the indexed template nucleic acid fragments; andassembling a sequence representation of the template nucleic acid fromthe sequence data.
 2. The method of claim 1, wherein the indexedtemplate nucleic acid is prepared by undergoing at least 2, 3, 4, 5, 6,7, 8, 9, 10 or more rounds of compartmentalizing and indexing.
 3. Themethod of claim 1, wherein generating the indexed template nucleic acidfragments comprises using transposomes comprising the first index totagment the template nucleic acid.
 4. The method of claim 1, whereingenerating the indexed template nucleic acid fragments comprisestransposition, and where subsequently the index is introduced usingligation
 5. The method of claim 1, wherein the indexed nucleic acidfragments of the cells are amplified with PCR.
 6. The method of claim 1,wherein the vessel of the plurality of vessels comprises at least 2 ormore cells.
 7. The method of claim 1, wherein the sequencerepresentation comprises haplotype information based at least in part onassembling sequence reads from indexed template nucleic acid fragments.8. The method of claim 1, wherein another vessel of the plurality ofvessels comprises indexed template nucleic acid fragments comprising thefirst index and a second index.
 9. The method of claim 1, wherein thesecond index is associated with the vessel.
 10. A library of templatenucleic acids, comprising: a tagmented template nucleic acid comprisinga plurality of transposons inserted into a template nucleic acid usingtransposomes, wherein at least some of the transposomes each comprise afirst transposon sequence and a second transposon sequence noncontiguouswith said first transposon sequence and a transposase, and wherein thetemplate nucleic acid is derived from a single cell.
 11. The library ofclaim 10, wherein the tagmented template nucleic acid from the cell isdistributed into a vessel and assigned a first index.
 12. The library ofclaim 11, wherein the first index identifies the sample.
 13. The libraryof claim 12, wherein the first index sequence is introduced usingligation.
 14. The library of claim 12, wherein the index sequence isinserted into the template nucleic acid using transposomes.
 15. Thelibrary of claim 10, wherein the template nucleic acid is a rollingcircle amplification product.
 16. The library of claim 10, wherein thetemplate nucleic acid is distributed in a well, on a microparticle orbead, or microfluidic device.
 17. The library of claim 10, wherein thetagmented template nucleic acid is associated with or coupled totransposases of the transposomes.
 18. A method of obtaining sequenceinformation from a single cell, comprising obtaining a template nucleicacid of a cell comprising a plurality of transposons inserted into saidtarget nucleic acid such that the contiguity of the template isretained, compartmentalizing the nucleic acid of the cell comprising theplurality of inserted transposons into a plurality of vessels,generating compartment-specific indexed libraries from the transposednucleic acid targets and obtaining single cell sequence information fromthe template nucleic acids from a plurality of vessels.
 19. The methodof claim 18, wherein generating the compartment-specific index isperformed using transposition or ligation.
 20. The method of claim 18,wherein prior to generating the indexed template nucleic acid fragments,the incorporation of a universal primer site using transposition orligation is used.
 21. The method of claim 18, wherein generating theindexed template nucleic acids fragments comprises (a)compartmentalizing the target nucleic acid of cells into a plurality offirst vessels; (b) providing a first index to the target nucleic acid ofeach first vessel, thereby obtaining a first indexed nucleic acid; (c)combining the first indexed nucleic acids; (d) compartmentalizing thefirst indexed template nucleic acids into a plurality of second vessels;and (e) providing a second index to the first indexed template nucleicof each second vessel, thereby obtaining a second indexed nucleic acid.22. The method of claim 21, wherein the indexed template nucleic acidsare generated by undergoing at least 2, 3, 4, 5, 6, 7, 8, 9, 10 or morerounds of compartmentalizing and indexing.
 23. The method of claim 21,wherein providing the first index comprises contacting the templatenucleic acid with a plurality of transposomes each comprising atransposase and a transposon sequence comprising the first index underconditions such that at least some of the transposon sequences insertinto the template nucleic acid.
 24. The method of claim 21, wherein thefirst index is introduced using ligation.